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Abstract Activity recognition has shown impressive 
progress in recent years. However, the challenges of de¬ 
tecting fine-grained activities and understanding how 
they are combined into composite activities have been 
largely overlooked. In this work we approach both tasks 
and present a dataset which provides detailed annota¬ 
tions to address them. The first challenge is to detect 
fine-grained activities, which are defined by low inter¬ 
class variability and are typically characterized by fine¬ 
grained body motions. We explore how human pose and 
hands can help to approach this challenge by compar¬ 
ing two pose-based and two hand-centric features with 
state-of-the-art holistic features. To attack the second 
challenge, recognizing composite activities, we leverage 
the fact that these activities are compositional and that 
the essential components of the activities can be ob¬ 
tained from textual descriptions or scripts. 

We show the benefits of our hand-centric approach 
for fine-grained activity classification and detection. For 
composite activity recognition we find that decomposi¬ 
tion into attributes allows sharing information across 
composites and is essential to attack this hard task. 
Using script data we can recognize novel composites 
without having training data for them. 
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1 Introduction 


Human activity recognition in video is a fundamental 
problem in computer vision. State-of-the-art methods 


(e.g. Tang et al. 

2012 

Wang et al. 

2013b Wang and 

Schmid 2013 

Karpathy et al. 20141 achieve near perfect 


results for simple actions (e.g. KTH dataset, Schuldt 


et al. 2004) and robustly recognize actions in realistic 


settings such as Hollywood movies (Marszalek et al. 
20091, videos from YouTube ( Liu et al.||2009 ), or sport 
scenes ( Rodriguez et al.||2008 |. 

While impressive progress has been made, we ar¬ 
gue that most works are addressing only a part of the 
overall activity recognition challenge. Many application 
scenarios, such as human-robot interaction or elderly 
care require to understand complex activities (e.g. does 
the person prepare food?), consisting of multiple fine¬ 
grained activities and object manipulations (e.g. is it 
fried and what is in it?). Frequently it is important 
to recognize both, the individual steps and the high 
level composite activities, e.g. as we have shown for 
the task of video description ( |Rohrbach et al. 2014). 
Consequently we approach both problems in this work: 
recognizing fine-grained activities and recognizing com¬ 
posite activities. Fine-grained activities are defined as 
a set of activities which are visually very similar, i.e. 
have a low inter-class variability. Composite activities 
are activities which can be temporally decomposed into 
multiple shorter activities, i.e. they consist of multiple 
steps. We note that both the terms are not exclusive, 
i.e. composite activities can also be fine-grained. In fact 
some of our composites are very similar. However, in our 
work we consider composite activities which consist of 
fine-grained activities. 

When surveying the field we also noticed a lack 
of datasets allowing to pursue the challenges of fine- 
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Figure 1 Sharing or transferring attributes of composite activities using script data. Composite activities (gray boxes) are 
composed of activities and their participants (light-blue boxes), modeled as attributes. These attributes can be transferred 
to unseen composite activities (dashed-line box) with the help of script data which allows estimating the relevant attributes 
(red). Our activities have the additional challenge of being fine-grained, we thus refer to them as fine-grained activities. 


grained and composite activity recognition. Specifically 
this is reflected in the following limiting factors of cur¬ 
rent benchmark databases. First, while datasets with 
large numbers of activities exist, the typical inter-class 
variability is high. This seems rather unrealistic for 
many domains such as surveillance or elderly care where 
we need to differentiate between consequentially dif¬ 
ferent but visually similar activities e.g. hug someone 
versus hold someone or throw in garbage versus put in 
drawer. Second, the activities considered so far are full- 
body activities, e.g. jumping or running. This appears 
rather untypical for many applications where we want 
to differentiate between more small motion and fre¬ 
quently hand centric activities. Consider e.g. the cutting 
activity in domains such cooking (see Figure [l]), hand¬ 
icraft work or surgeries, as well as different repairing 
activities in the domain of house keeping or machine 
maintenance with subtle difference in motion and low 
inter-class variability. As a third limitation we found 
that many available databases contain videos of few 
second length and focus on simple basic-level activities 
such as walking or drinking. In contrast, the recognition 
of longer-term, complex, and composite activities such 
as assembling furniture, food preparation, or surgeries 
have been rarely addressed in computer vision. Notable 
exceptions exist (see Section [2]) even though these have 
other limiting factors such as small number of classes. 


In this work, which is an extension of our original 


publications (Rohrbach et al. 2012aI and (Rohrbach 


et al.j[2012b|, we recorded, annotated, and publicly re¬ 


leased a large-scale dataset in a kitchen scenario which 
addresses the discussed limitations. This allows us to 
work on the challenges of fine-grained and composite 
activity recognition as follows. 


Recognizing fine-grained activities is challenging due 
to their low inter-class variability. In contrast to fine¬ 
grained object recognition challenges where the same 
object category typically is also visually consistent, ac¬ 
tivities of the same category are frequently very diverse, 
i.e. have a high intra-class variability. Consider e.g. the 
activities peeling, which can be very different depending 
of the participating object: peeling a carrot versus peel¬ 
ing a pineapple. At the same time, we have to handle 
small differences between categories, i.e. low inter-class 
variability, consider e.g. mix versus stir or slice ver¬ 
sus cut dice. This typically requires to understand the 
difference between fine-grained body motions. To ap¬ 
proach both of these challenges we propose to focus on 
body pose and hands. As can be seen in Figures [T] and [2] 
many fine-grained activities, especially in our kitchen 
scenario, are hand-centric. Here it is not only important 
to understand the activity but also the participating 
object, e.g. open egg versus open tin. We thus propose 
to focus on the hand regions for extracting visual fea¬ 
tures. However, hand detection is a challenging problem 
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in itself in real-world scenarios due to a large variability 


in shape and frequent partial occlusions (Mittal et al. 


2011 Gkioxari et al.|2013 ). To get reliable hand detec¬ 
tions, we integrate a hand detector into an articulated 
pose estimation. Consequently we use the hand posi¬ 


tion to extract color Sift and Dense Trajectories (Wang 


et al.|2013a ) and learn detectors for fine-grained activ¬ 
ities and their participating objects. Recently, |Jhuang| 
et al. (2013) showed that exploiting body pose in form 


of body joints can be beneficial for full-body activities. 
We explore two approaches based on body pose tracks, 
motivated from work in the sensor-based activity recog¬ 
nition community ( Zinnen et al.|[2009 ). 

For recognizing composite activities, state-of-the- 
art methods, which build on discriminative learning 
from low-level activity features, experience scalability 
issues due to the typically highly diverse composite ac¬ 
tivities and little training data. A promising approach 
towards scaling activity recognition methods to a large 
number of complex activities is to use intermediate rep¬ 
resentations that are shared and transferred across ac¬ 
tivities by exploiting their compositional nature. We ex¬ 
ploit this technique and propose building on an attribute- 
based representation, with attributes denoting the fine¬ 
grained activities and the participating objects. For ex¬ 
ample in Figure[l]the composite activity preparing scram¬ 
bled egg shares the attributes stir and spatula with the 
composite activity preparing onion and the attributes 
open and egg with the composite activity separating 
egg. Instead of learning a holistic model for each com¬ 
posite activity we learn models for a large set of at¬ 
tributes shared across composite activity classes. Such 
approaches have been shown effective to recognize pre¬ 
viously unseen object categories (Lampert et al.|[2013) 


and have also been applied to activity recognition (Liu 


et al. 2011). A major challenge to recognize everyday 


activities is that these composite activities can often be 
performed in a wide variety of ways, and it is practically 
infeasible to create a visually annotated training set 
with all possible alternatives. Instead, we collect a large 
number of textual descriptions (scripts) for a composite 
activity to compute the association strength between 
attributes and composite activities. Using this script 
data we can not only handle the inherent variation of 
composites but also recognize unseen composite activ¬ 
ities. As illustrated in Figure [T] the attributes in red 
are determined to be important for preparing scram¬ 
bled eggs using script data and can be transferred from 
known composites such as separating egg and preparing 
onion. 


Our main contributions are as follows. First, we pro¬ 
pose several hand- and pose-based activity recognition 
approaches to recognize fine-grained activities and their 


object participants. We benchmark them together with 
state-of-the-art activity recognition features on our data¬ 
set. Second, we contribute an attribute-based approach 
which shares knowledge across composite activities and 
exploits textual script data to handle their large vari¬ 
ability and allows transfer to unseen composite activi¬ 
ties. Third, we recorded and annotated a video dataset 
called MPII Cooking 2. It provides challenges for clas¬ 
sification and detection of fine-grained activities and 
their participants, human pose estimation, and compos¬ 
ite activity recognition (optionally) using script data. 
In addition to activity recognition, which is the fo¬ 
cus of this work, the dataset is also being used for 


3D human pose estimation ( 

Amin et al 

2015 

), multi- 

frame pose estimation ( 

Cherian et al. 

2014 

, discov- 

ering object categories from activities ( 

Srikantha and 


Gall 2014), grounding semantic similarities of natural 


language sentences in video ( 

Regneri et al. 

2013 

), and 

for generating natural language descriptions 

(Rohrbach 


The remaining article is structured as follows. We 
first make an extensive review of related datasets, activ¬ 
ity recognition approaches, and the use of text data for 
visual recognition in Section [2] Then we introduce our 
MPII Cooking 2 dataset in Section [3] which we bench¬ 
mark in the subsequent sections. In Section [4] we make 
a quantitative comparison of our pose-recognition and 
hand detection with related work on the pose challenge 
of our dataset. Using the pose-estimation and hand de¬ 
tections we define several visual features and discuss 
fine-grained activity detection in Section[5] In Section[6] 
we present our approach to combine the fine-grained 
activities to composite activities and integrate script 
data. In Section [7] we evaluate fine-grained and com¬ 
posite activity recognition and then we conclude with 
the most important findings and directions for future 
work in Section |8] 


2 Related work 


We first present an overview of the different video activ¬ 
ity recognition datasets (Section 2.1) and then review 
recent approaches to activity recognition (Section [2(2] ), 
putting a focus on works which use human pose as a 
cue. Next we discuss works which use textual informa¬ 
tion for improved recognition of activities (Section ! 

We conclude by relating them to our work (Section 1 


2.3). 


2.4). 


2.1 Activity Datasets 

Even when excluding single image action datasets such 
as the Stanford-40 Action Dataset (Yao et al. 2011b) 
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Dataset 

els, det 

classes 

clips/videos 

subjects 

/£ frames 

resolution 

Full body pose datasets 

KTH (Schuldt et al. 2004) 

els 

6 

2,391 

25 

~200,000 

160x120 

USC gestures (INataraian and Nevatia:,2008) 
MSR action ( Yuan et al.|2U09) 

els 

6 

400 

4 


740x480 

els, det 

3 

63 

10 


320x240 


Movie and web video datasets 


Hollywood2 iMarszalek et al.12009 



els 

12 

1,707/69 


UCF 101 ( 
Sport s-lAT 

Soornro et al. 2012) 




els 

101 

13,320 

~2,400,000 320x240 

(Karpathy et al. 2014 




els 

487 

1.1 mil 


HMDB51 (Kuehne et al. 2011) 




els 

51 

6,766 

height: 240 

ASLAN 
Coffee an 

Kliper-Gross et al. 2012 



els 

432 

3,631/1,571 


d 

Cigarettes (Laptev and Perez 2007 


det 

2 

264/11 


High Five 

Patron-Perez et al. 2010) 


els, det 

4 

300/23 


MU11 Movie Description (jRohrbach et al.|2015 

) 

els, det 


68,327/94 

1920x1080 


Surveillance datasets 







PETS 2007 

Ferryman 2007 

det 

3 

10 


32,107 

768x576 

U’l interaction (Ryoo and Aggarwai 2009 

els, det 

6 

120 

6 



VIRAT ( 

Oh et al. 2011) 

det 

23 

17 


1920x1080 


Assisted daily living datasets 

TUM Kitchen (Tenorth et al.12009) 
CMU-MMAC (la Lorre et ai. 2009) 

det 
els, det 

10 

>130 

20/4 

26 


36,666 

384x288 

1024x768 

URADL 

Messing et al. 2009) 

els 

17 

150/30 

5 

< 50,000 

1280x720 

MPli Cooking 2 (our dataset) 

els, det 

67/ 59 

14,105/273 

30 

2,881,616 

1624x1224 


Table 1 Overview of activity recognition datasets: We list if datasets allow for classification (els), detection (det); number of 
activity classes; number of clips extracted from full videos (only one listed if identical), number of subjects, total number of 
frames, and resolution of videos. We leave fields blank if unknown or not applicable. 


or the Pascal Action Classification Challenge (Ever- UCF 101 (Soornro et al. 2012), significantly increas- 


inglram et al. 2011), the number of proposed activity 


datasets is quite large (Chaquet et al. (20131 survey 68 


datasets). Here, we focus on the most important ones 
with respect to database size, usage, and similarity to 
our proposed dataset (see Table [I]). We distinguish four 
broad categories of datasets: full body pose, movie and 
web, surveillance, and assisted daily living datasets - 
our dataset falls in the last category. 

The full body pose datasets are defined by actors 


ing the number of categories to 101 and including 2.4 
million frames at a rather low resolution of 320x240. 
The Sports-IM dataset exceeds all datasets with re¬ 
spect to number of clips (1.1 million) and categories 
(487 different sports), which are, however, only weakly 
labeled. Hollywood2 ( Marszalek et al.||2009 ), HMDB51 
(Kuehne et al.poTT), and ASLAN (Kliper-Gross et al. 


performing full body actions. KTH 

(Schuldt et al. 2004) 

USC gestures 
ilar datasets ( 

Natarajan and Nevatia 

2008), and sim- 

Singh and Nevatia 

2011 

) require classi- 


2012) have very diverse activities. Especially HMDB51 


fying simple full body and mainly repetitive activities. 


(Kuehne et al. 2011) is an effort to provide a large scale 
database of 51 activities while reducing the database 
bias. Although it includes similar, fine-grained activi¬ 
ties, such as shoot bow and shoot gun or smile and laugh , 
most classes have a large inter-class variability and the 


The MSR actions (Yuan et al. 2009) pose a detection videos are low-resolution. ASLAN (Kliper-Gross et al. 


challenge limited to three classes. In contrast to these 
full body pose datasets, our dataset contains more and 
in particular fine-grained activities. 

The second category consists of movie clips or web 
videos with challenges such as partial occlusions, cam- 

and sirni- 


2012) focuses on a larger number of activities but with 


little training data per category. The task is to iden¬ 
tify similar videos rather than categorising them. A 
significantly larger video collection is evaluated during 


era motion, and diverse subjects. UCF5(J^ 


lar datasets (Liu et al. 

2009 

Niebles et al. 

2010 

Ro- 

driguez et al. 

2008 

) focus on sport activities. Kuehne 


the TRECVID challenge (Over et al. 2012). The 2012 


et aids evaluation suggests that these activities can al¬ 
ready be discriminated by static joint locations alone 


(Kuehne et al. 2011). UCF50 has been extended to 


http://vision.eecs.ucf.edu/data.html 


challenge consisted of 291h of short videos from the 
Internet Archive (archive.org) and more than 4,000h 
of multi-media (audio and video) data. The challenge 
covers different tasks including semantic indexing and 
multi-media event recognition of 20 different event cat¬ 
egories such as making a sandwich and renovating a 
home. Large parts of the data are, however, only avail- 
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able to the participants during the challenge. Although 
our dataset is easier in respect to camera motion and 
background, it is challenging with respect to a smaller 
inter-class variability. 


ject uses different ingredients and tools in each dish. 
The authors also record an egocentric view. Similarly 


Farhadi et al. 

2010 

Fathi et al. 

2011 

Stein and 


McKenna 2013) the camera view mainly shows hands 


The datasets Coffee and Cigarettes (Laptev and Perez and manipulated cooking ingredients. Also recorded in 


2007 

1 and High Five ( 

Patron-Perez et al. 2010 

are dif- an egocentric view, 

Pirsiavasli and Ramanan 

(2012) 


ferent to the other movie datasets by promoting activ¬ 
ity detection rather than classification. This is clearly a 
more challenging problem as one not only has to classify 
a pre-segmented video but also to detect (or localize) an 
activity in a continuous video. As these datasets have a 
maximum of four classes, our dataset goes beyond these 
by distinguishing a large number of classes. The recent 
MPII Movie Description dataset (Rohrbach et al. 2 01~5| 


does not label clips with labels but with natural sen¬ 
tences which are sourced from movie scripts and audio 
descriptions for the blind. 

The third category of datasets is targeted towards 
surveillance. The PETS (Ferryman 2007) or SDHA201(j^] 


workshop datasets contain real world situations from 
surveillance cameras in shops, subway stations, or air¬ 
ports. They are challenging as they contain multiple 
people with high partial occlusion. The UT interac¬ 


tion (Ryoo and Aggarwal 2009) requires to distinguish 6 


different two-people interaction activities, such as punch 
or shake hands. The VIRAT (Oh et al.|20 11) dataset is 
a recent attempt to provide a large scale dataset with 
23 activities on nearly 30 hours of video. Although the 
video is high-resolution people are only of 20 to 180 
pixel height. Overall the surveillance activities are very 
different to ours which are challenging with respect to 
fine-grained hand motion. 

Next we discuss the domain of Assisted daily liv¬ 
ing (ADL) datasets , which also includes our dataset. 
The University of Rochester Activities of Daily Liv¬ 
ing Dataset (URADL) ( |Messing et al. 2009) provides 
high-resolution videos of 10 different activities such as 
answer phone, chop banana, or peel banana. Although 
some activities are very similar, the videos are produced 
with a clear script and contain only one activity each. 
In the TUM Kitchen dataset (Tenorth et al. |2009 ) all 
subjects perform the same composite activity ( setting a 
table) and rather similar actions with limited variation. 


Roggen et al. (2010) and la Torre et al. (2009) present 


recent attempts to provide several hours of multi-modal 
sensor data (e.g. body worn acceleration and object lo¬ 
cation). But unfortunately people and objects are (vi¬ 
sually) instrumented, making the videos visually un¬ 


realistic. In the CMU-MMAC dataset (la Torre et al. 


2009) all subjects prepare the identical five dishes with 


very similar ingredients and tools. In contrast to this 
our dataset contains 59 diverse dishes, where each sub- 

2 http://cvrc.ece.utexas.edu/SDHA2010/ 


propose a dataset of 18 diverse daily living activities, 
not restricted to the cooking domain, recorded in dif¬ 
ferent houses in non-scripted fashion. 

Overall our dataset fills the gap of a large database 
with on the one hand a detection challenge of fine¬ 
grained activities and on the other hand a recognition 
challenge of highly variable composite activities. 

2.2 Advances in activity recognition 

Activity recognition for still images has been advanced 
e.g. by jointly modeling people and objects (Yao and 


|Ll||2012 ) or scenes and objects ( Li and Li]|2007 ). In the 
following we focus on recognizing activities in video, 
distinguishing three aspects: holistic features for activ¬ 
ity recognition, exploiting body pose, and modelling the 
temporal structure of activities. 

To create a discriminative feature representation of 
a video, many approaches first detect space-time inter- 


est points ( 

Chakraborty et al. 

2011 

Laptev 

2005 

sample them densely ( 

Wang et al. 2009a 

and then 


ex¬ 


tract diverse descriptors in the image-time volume, such 
as histograms of oriented gradients (HOG) and his¬ 
tograms of oriented flow (HOF) ( Laptev et al.|2008 l or 


local trinary patterns (Yeffet and Wolf 2009). Messing 


et al. (2009) found improved performance by tracking 


Harris3D interest points (Laptev 20051. The state-of- 


the-art Dense Trajectories approach from |Wang et al.| 
(2013a) uses this idea: it tracks dense feature points 


and extracts strong video features around these tracks, 
namely HOG, HOF, and Motion Boundary Histograms 
(MBH, Dalai et al.||2006 ). They report state-of-the art 


results on several datasets including KTH (Schuldt et al. 


20041, UCF YouTube 

Liu et al. 

2009 

, Hollywood2 

(Marszalek et al. 

2009), and HMDB51 ( 

Kuehne et al. 

2011). Recently, 

Wang and Schmid ( 

2013) improved 


their approach by removing background flow and by en¬ 
suring that detected humans do not contribute to the 
background motion estimation. Additionally they re¬ 
place the BoW encoding with Fisher vectors. The com¬ 
putational effort of this approach can be significantly 
reduced by replacing dense flow with motion informa¬ 


tion from video compression Kantorov and Laptev (2014) 


As alternative to manually defined activity features, 


Taylor et al. 

(2010), 

Baccouche et al. 

(2011 

), 

Le et al. 

(2011 

1, and Ji et al. 

(2013 

) use deep learning with con- 


volutional neural networks to learn an activity feature 
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representation. So far these approaches cannot reach 
the manually defined Dense Trajectories even when learn¬ 


ing on a database of over a 1 million videos (Karpathy 
etaL|20l4 ). 


Human body poses and their motion frequently char¬ 
acterize human activities and interactions. This has been 
exploited in Microsoft’s Kinect, which uses human pose 
as a game controller but relies on a depth sensor to rec¬ 


ognize human pose (Shotton et al. 2011). Earlier work 


in human pose based activity recognition employed mo¬ 
tion capture systems using physical on-body markers to 
reliably capture human poses, e.g. (Campbell and Bo- 


bick 19951. Such an approach is impractical for record¬ 


ing realistic data. Recently a number of hand and pose¬ 
centric approaches have been proposed for activity recog- 


nition for more realistic video recordings ( 

Fathi et al. 

2011 

Packer et al. 2012; Yao et al. 2011a Sung et al. 

2011 

Raptis and Sigal|2013 Jhuang et al.|2013) as well 

as in static images 1 

Yang et al. 2011 

Yao and Li|2012). 


| Packer et al.| demonstrate impressive results in recog¬ 
nition of kitchen activities using body poses recovered 


from depth images. Fathi et al. (2011) propose a hand¬ 


centric approach for learning effective models of activ¬ 
ities from egocentric video by observing regularities in 
hand-object interactions. Hand poses have been shown 
to facilitate extraction of appearance features for activ¬ 
ity recognition in static images ( Karlinsky et al.||2010 ). 
Pose-based models are effective for activity recognition 
when body poses can be estimated reliably, as e.g. in 


depth images (Packer et al. 


2012 


Sung et al. 2011). 


Mittal et al. (20111 and |Gkioxari et al" (2013) aim for 


specialized representations for hands, but do not apply 
them to pose estimation or activity recognition. Jhuang 


et al. (2013) study the benefits of pose estimation for 


activity recognition on a subset of the HMDB data¬ 
set (Kuehne et al. 2011). They show that ground truth 
pose, estimated over time can significantly outperform 


the holistic Dense Trajectories features (Wang et al. 


2013a); this is also true for estimated pose using (Yang 


and Ramanan|2013 1 but only on a subset where the full 


body is visible. 

Although several interesting techniques have been 
proposed to model the temporal structure of videos, 
they typically perform only below or on par with bag- 
of-word based approaches: A simple temporal structure 
is encoded in the template-based Action MACH from 
Rodriguez et al. (2008), Brendel and Todorovic ( 2011| ) 


model temporal and spatial structure by segmenting the 


space-temporal volume, and Niebles et al. (2010) model 


activities as a temporal composition of primitive actions 
and discriminatively learn such models. While |Niebles| 
|et al.| fix anchor points and the length of the temporal 
segments before training, Tang et al. (2012) learn all 


parameters from data using a variable-duration hidden 
Markov model. An AND/OR graph structure can be 


used to combine different features at its nodes (Tang 


et al.||2013 ) or model co-occurring and consecutive ac¬ 
tions (Gupta et al.||2009). Recently Pirsiavash and Ra- 


rnanan 


(2014) have shown how to efficiently parse ac¬ 


tivity videos with segmental grammars. 

2.3 Natural language text for activity recognition 
Natural language descriptions have shown beneficial for 


image segmentation (Socher and Fei-Fei 

2010) or rec- 

ognizing object categories (Wang et al. 

2009b Elho- 

seiny et al. 2013). Similar to our work, Elhoseiny et al. 


use classifiers trained on the known classes. Represent¬ 
ing the text descriptions with tfitddf (term frequency 
times inverse document frequency) vectors for relevant 
encyclopedic entries, they compare a regression, a do¬ 
main adaptation, and a newly proposed constrained 
optimization formulation to learn a function from the 
textual vector to the visual classifier space. On two 
fine-grained visual recognition datasets, CU200 Birds 
(Welinder et al.|2010) and Oxford Flower-102 (Nilsback 


and Zisserman|2008 1, they show the benefit of their con¬ 


straint optimization approach. Semantic similarity from 
linguistic resources has also been used to allow zero-shot 
recognition in images via attributes and direct similar¬ 


ity (Rohrbach et al. 2010) and by learning an embed¬ 


ding into a linguistic word vector space (Socher et al. 


2013 Frome et al. 2013). Additionally to transferring 


knowledge one can exploit the unlabeled instances to 
improve recognition, assuming a transductive setting. 
For this, Fu et al. (2013) exploit the test-data distri¬ 


bution by performing a single round of self-training by 
averaging over the k-nearest neighbors. 


Teo et al. (2012) improve activity recognition by 


adding object detectors, which are selected based on 
the linguistic co-occurrence statistics in the newswire 
Gigaword Corpus. A similar idea is pursued by |Mot-| 
wani and Mooney (2012), who mine and cluster verbs 


from descriptions of the video snippets in the MSVD 


dataset (Chen and Dolan 2011). Zhang et al. (2011) 


show that tfitddf can identify the most relevant terms in 
text descriptions collected for seven video scenes allow¬ 
ing to yields close to perfect (98%) recognition accu¬ 


racy on their dataset. Ramanathan et al. (2013) jointly 


recognize actions and roles in YouTube videos using 
their captions. They mine a large number of YouTube 
descriptions and use a topic model to estimate the se¬ 
mantic relatedness between an action/role and a de¬ 
scription. 

Another line of work focuses on describing videos 
with natural language descriptions. Recently |Guadar 
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rama et al. (2013) generated simple sentences for the 


Microsoft Video Description corpus (Chen and Dolan 


ents are relevant for a certain dish (composite activ¬ 
ity). For this we compare co-occurrence statistics with 


2011 

1 containing challenging web videos. 

Das et al. 

(2013 

) tf*idf, which has also been used by 

Zhang et al. 

(2011) 


compose descriptions for kitchen videos of their YouCook 
dataset showing YouTube cooking videos. Finally, we 
have shown how to learn a translation model for gener¬ 


ating natural sentences on our dataset (Rohrbach et al. 


2013b). 


2.4 Relations to our work 

Most of the activity recognition approaches and data¬ 
sets have been evaluated on full-body motion or chal¬ 
lenging web or movie datasets but not on fine-grained 
motions with low inter-class variability. We therefore 
evaluate the holistic Dense Trajectories approach from 


Wang et al. (2013a) as well as two pose-based and two 


hand centric approaches on our MPII Cooking 2 data¬ 
set. Our pose-based approach encodes trajectories of 
body joints using features motivated from the sensor- 


based activity recognition community (Zinnen et al. 


2009). The features are also similar to the relational 


and distance features defined on joints by |Jhuang et al.| 
Similarly to their work we define relational and dis¬ 
tance metrics between joints per frame and over time. 
However, our activities contain very subtle motions and 
the people have a very similar pose for most activities, 
which reduces the benefits of this feature representa¬ 
tion. jjhuang et al.| examine the advantages of focusing 
Dense Trajectories (Wan g et al.||20 13a) on body joints. 
In our static scene (holistic) Dense Trajectories are al¬ 
ready restricted to human body as the features are only 
extracted on moving points. However, in this work we 
propose to focus on hands, as they are the main cue for 
recognizing our fine-grained activities and participating 
objects. 


In (Amin et al. 2013) we improve the hand local¬ 


ization by leveraging multiple cameras to handle self¬ 
occlusion. In this work we remain monocular and pro¬ 
pose to use a specialized hand detector to improve pose 
estimation and activity recognition. 

To improve fine-grained activities and their partic¬ 
ipating objects we train a classifier on stacked classi¬ 
fier scores from co-occurring activities/objects as well 
as from temporal context after max pooling. Classifier 


stacking has previously been explored e.g. in I 

Ting and 

Witten 

1997 Liu et al. 

2012 

Sill et al. 

2009 

). Most 

relevant to our work, 

Liu et al. 

(2012 

try to optimize 


the usage of training data and avoid over-fitting when 
learning stacked video classifiers. This could be benefi¬ 
cial when applied to our approach. 

In this work we exploit cooking instructions (script 
data) to extract which activities, tools, and ingredi- 


and Elhoseiny et al. (2013) to extract relevant concepts 


for video scene and object recognition. We find that 
tf*idf better discriminates different dishes and improves 
performance in most cases. Script data allows for zero- 
shot recognition, which has mainly been used for object 
recognition, but also for multi-media data by |Fu et al(] 
(2013). Fu et al. learn a latent attribute representation 


on the known classes, but then use manually defined 
attribute associations to transfer. 

While the temporal structure, i.e. temporal order¬ 
ing, seems an important component to recognize activ¬ 
ities, so far mainly the short term structure of short 


video clips has been explored (e.g. Gupta et al. 2009 
Brendel and TodorovicpOll Tang et al.||2012 ). In this 

work we exploit temporal co-occurrence within the same 
time interval and context of short actions and their par¬ 
ticipating objects within the entire video using max 
pooling. For long term composite activities we aggre¬ 
gate its components with max pooling ignoring the tem¬ 
poral order. Nevertheless, we believe that the temporal 
structure of scripts ( Regneri et al.pOlO ) might form a 
good prior for the temporal structure of videos and vise- 
versa. Bojanowski et al. (2014) have recently shown the 


benefit of movie scripts as a weak supervision. They use 
the ordering constraints provided by the script data to 
localize the actions and to learn action models. 

Finally we shortly summarize how this work extends 


our original publications (Rohrbach et al. 2012a) and 


(Roh rbach et al.|[2012b ). First, we updated the dataset 
by correcting and unifying some of the annotations and 
adding a few more videos. We refer to this new ver¬ 
sion as MPII Cooking 2. It supersedes both previous 
datasets, see Table [3] Second, we present hand-centric 
approaches for fine-grained recognition, namely an inte¬ 
gration of pose-estimation and hand detector and Hand 


centric features for activity recognition (arXiv: Senina 


et ahl 2014). Third, we integrated our Propagated Se¬ 


mantic Transfer (PST) from Rohrbach et al. (2013b) 
for composite recognition. Fourth, we extended quali¬ 
tative and quantitative results. Fifth, we extended the 
discussion of related work. Sixth, we rerun experiments 


with updated version of Dense Trajectories (Wang and 


Schmid 2013). And last, we will release the updated ver¬ 


sion of the dataset, new intermediate features as well 
as the script data. 

3 Dataset “MPII Cooking 2” 

For our dataset we video-recorded human subjects cook¬ 
ing a diverse set of dishes, e.g. making pizza or preparing 
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(b) (c) (d) <h) 


Figure 2 Single frames from the dataset depicting fine¬ 
grained cooking activities and diverse sets of tools and in¬ 
gredients (participants), (a) Full scene of slicing in the com¬ 
posite activity omelet, and crops of (b) take out, (c) dicing, 
(d) take out, (e) squeeze, (f) peel, (g) wash, (h) grate 

cucumber. The dishes form the composite activities and 
the individual steps taken are the fine-grained activities, 
e.g. cut, pour, or spice. All videos have a composite la¬ 
bel and are annotated with time intervals. Each time 
interval has a fine-grained activity and the participat¬ 
ing objects as labels. A subset of frames was annotated 
with human pose and hands. In the following we pro¬ 
vide details and statistics of the dataset, Figures |T| and 
[2] show example frames of the dataset. 


MPII sandwich, salad, fried potatoes, potato pancake, 
Cook- omelet, soup, pizza, casserole, mashed potato, 
ing snack plate, cake, fruit salad, cold drink, and hot 

drink 

MPII cooking pasta, juicing {lime, orange}, making 
Com- {coffee, hot dog, tea}, pouring beer, prepar- 
posites ing {asparagus, avocado, broad beans, broc¬ 
coli and cauliflower, broccoli, carrots and pota¬ 
toes, carrots, cauliflower, chilli, cucumber, 
figs, garlic, ginger, herbs, kiwi, leeks, mango, 
onion, orange, peach, peas, pepper, pineap¬ 
ple, plum, pomegranate, potatoes, scram¬ 
bled eggs, spinach, spinach and leeks}, sepa¬ 
rating egg, sharpening knives, slicing loaf of 
bread, using {microplane grater, pestle and mor¬ 
tar, speed peeler, toaster, tongs}, zesting lemon 

Table 2 Composite activities (dishes) of MPII Cooking 2 
dataset, composites marked in bold are part of the test split. 

For this work we corrected and unified some of the 
annotations and added a few more videos. We refer 
to this new dataset version as MPII Cooking 2. It su¬ 
persedes both previous datasets. Tabic [3] compares the 
different versions and shows different statistics about 
them. The table also shows the proposed training/vali¬ 
dation/test split, which is selected in a way that for 
all 31 composite activities in the test set, there are at 
least 3 training/validation videos and there is no over¬ 
lap between training, validation, and test subjects. In 
contrast to the earlier versions we avoid multiple test 
splits for simpler evaluation and to reduce the compu¬ 
tational burden for other researchers evaluating on the 
dataset. 


3.2 Dataset recording and annotation protocol 


3.1 Dataset statistics and versions 


We recorded 30 subjects in 273 videos with a total 
length of more than 27 hours or 2,881,616 frames. Each 
video contains a single subject preparing a certain dish. 

The dataset was recorded in two batches. The first 
part contains few, but very diverse and complex dishes 
(see upper part of Table [ 2 ]) and was presented in (Rohr¬ 


bach et al. 2012a[). The second part, presented in (Rohr- 
bach et al.||2012b[), focuses on composite activities and 


thus contains significantly more dishes/composites which 
are slightly shorter and simpler, see lower part of Ta¬ 
ble^ The second set of composite activities are selected 
according to our script corpus which we describe below 
in Section [3.4| We ignored some of them which were ei¬ 
ther too elementary to form a composite activity (e.g. 
how to secure a chopping board), were duplicates with 
slightly different titles, or because of limited availability 
of the ingredients (e.g. butternut squash). 


To record realistic behavior we neither asked subjects to 
perform certain activities nor to follow a certain recipe 
but we told them only which dish they should prepare. 
This resulted in a larger variety of how subjects pre¬ 
pared things. This means subjects used different tools 
for preparation ( knife or peeler for peeling), took differ¬ 
ent steps (e.g. some people cooked the vegetables some 
did not), and did things in different temporal orders for 
the same dish (e.g. washed the vegetable before or after 
they peeled it). Before the recording the subjects were 
shown our kitchen and places of tools and ingredients 
to feel at home. During the recording subjects could ask 
questions in case of problems and some listened to mu¬ 
sic. We always started the recording with an empty and 
clean kitchen, prior to the subject entering the kitchen 
and ended it once the subject declared to be finished, 
i.e. we did not include the final cleaning process. Most 
subjects were university students from different disci¬ 
plines recruited by e-mail and publicly posted flyers. 
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videos 

subjects 

categories 

ground truth 

attribute 

video 





composites 

attributes 

time intervals 

instances 

duration 

MPII Cooking 

Rohrbach et al.|2012a 

44 

12 

14 

218 

3,824 

15,382 

3-41 min 

MPil Composites (Kohrbach et al. 2U12b) 

212 

22 

41 

218 

8,818 

33,876 

1-23 min 

combined 


256 

30 

55 

218 

12,642 

49,258 

1-41 min 

MPII Cooking 2 

273 

30 

59 

222 

14,105 

54,774 

1-41 min 

- Training set 


201 

24 

58 

222 

10,931 

42,619 

1-41 min 

- Validation set 


17 

1 

17 

107 

445 

1,662 

1-8 min 

- Test set 


42 

5 

31 

169 

2,102 

8,023 

1-13 min 


Table 3 Dataset statistics. Note that the train/val/test split do not add up to the full dataset, as some videos of the test 
subjects are not used as they have less than three train/val videos. 


1. get a large sharp knife 

2. get a cutting board 

3. put the cucumber 
on the board 

4. hold the cucumber 
in your weak hand 

5. chop it into slices with 
your strong hand 

Table 4 Three example scripts for 


1. gather your cutting board and knife. 

2. wash the cucumber. 

3. place the cucumber flat 
on the cutting board. 

4. slice the cucumber 
horizontally into round slices. 


composite activity preparing cucumber. 


1. wash the cucumber 

2. peel the cucumber 

3. place cucumber on 
a cutting board. 

4. take a knife and rock it 

back and forth on the cucumber 

5. make a clean thin slice each time. 


Subjects were paid per hour and cooking experience 
ranged from beginner cookers to amateur chefs. 


Composite activities are annotated on the level of 
each video. Fine-grained activities were annotated with 
a two-stage revision phase with start and end frame us¬ 
ing the annotation tool Advene (Aubert and Prie 2007). 
In addition to the activity category each annotation 
consists of used tools, ingredients, and locations (we re¬ 
fer to them as participants). Composite activities were 
chosen as described in Sections |3.1| and |3.4| Activity, 
tool, ingredient, and location categories were chosen to 
describe all activities the human subjects were perform¬ 
ing. The decision was made after the recording on the 
base what the human subjects did. With respect to the 
level of detail, we do not annotate the specific motions 
(e.g. move arm up or down) but what effect or semantic 
they have (e.g. open versus close). See Table [T] for the 
chosen granularity. 


We recorded in our kitchen (see Figure [2|a)) with a 
4D View Solutions system using a Point Grey Grasshop¬ 
per camera with 1624x1224 pixel resolution at 29.4fps 
and global shutter. The camera is attached to the ceil¬ 
ing, recording a person working at the counter from the 
front. We provide the sequences as single frames (jpg 
with compression set to 75) and as video streams (com¬ 
pressed weakly with mpeg4v2 at a bit-rate of 2500). For 
most videos we recorded 7 additional camera views on 
a subset was used and released by |Amin| 
. Although they are not used in this work 
we will make the remaining 7 views available upon pub¬ 
lication. All fine-grained and composite activity annota¬ 


the kitchen, 


et al. (2013 


tions are also valid for the other cameras as each frame 
was synchronized across all 8 cameras. 

We also provide intermediate representations of holis¬ 
tic video descriptors, human pose detections, tracks, 
and features defined on the body pose. We hope this 
will foster research at different levels of activity recog¬ 
nition. 

The dataset provides furthermore human body pose 


annotations (see Section 3.3), script data (see Section 3.4) 
and there exist textual descriptions in the TACoS (Reg- 
neri et al~||20 131 and TACoS multi-level corpus (Rohr- 


bach et al. 2014). The descriptions in TACoS describe 


what happens in a specific video and are temporally 
aligned to the video, i.e. they provide a textual annota¬ 
tion. In contrast, the scripts used in this work are col¬ 
lected independently of the video and thus contain do¬ 
main or script knowledge, i.e. what activities and what 
objects are likely used for a certain dish. As they are 
not specific to the training videos they allow to transfer 
and generalize to novel test scenarios. 


3.3 Pose Challenge 

A subset of frames have articulated human pose and 
hand annotations to learn and evaluate pose estima¬ 
tion approaches and hand detectors. For human pose 
we annotated the frames with right and left shoulder, 
elbow, wrist, and hand joints as well as head and torso. 
We have 2,994 frames of 10 subjects for training of pose 
annotation and an additional of 4,250 training images 
with hand points used for training the hand detector. 
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For testing we sample 1,277 frames from all activities 
with 7 subjects as test set for the pose challenge. All 


training and test frames are from MPII Cooking (Rohr¬ 


bach et al. 2012a) and thus avoid an overlap with the 


test subjects and test composites in MPII Cooking 2. 


3.4 Mining script data for composite activities 


Linguistics and psychology literature knows prototyp¬ 
ical sequences of certain activities as so-called scripts 
(I jBarr and Feigenbaum 1981). Scripts describe a cer¬ 
tain scenario which corresponds to composite activi¬ 
ties in our case. Scenarios (e.g. eating in a restaurant) 
are temporally ordered events ( the patron enters restau¬ 
rant,, he takes a seat, he reads the menu,...) and sub¬ 
jects ( patron, waiter, food, menu,...). Written event se¬ 
quences for a scenario can be collected on a large scale 
using crowd-sourcing (Regneri et al. 20101. We make 
use of this method to collect scripts for our compos¬ 
ite activities and assembling a large number of written 
sequences for each of those. 

We collect natural language sequences similar to 
Regneri et al. ( 2010) using Amazon’s Mechanical Turl^j 


For each composite activity, we asked the subjects to 
give tutorial-like sequential instructions for executing 
the respective kitchen task. The instructions had to 
be divided into sequential steps with at most 15 steps 
per sequence. We select 53 relevant kitchen tasks as 
composite activities by mining the tutorials for basic 
kitchen tasks on the webpage “Jamie’s Home Cooking 
Skills’^) All those tasks/scenarios are about process- 
esing ingredients or using certain kitchen tools. In ad¬ 
dition to the data we collected in this experiment, we 
use data from the OMICS corpus (Singh et al. 20021 
and Regneri et al. (2010) for 6 kitchen-related com¬ 
posite activities. This results in a corpus with 59 com¬ 
posite activities and 2,124 sequences in sum, having a 
total of 12,958 individual event descriptions. Note that 
for practical reasons we only recorded videos for 35 of 
these composite activities as discussed in Section |3.1| 
They are listed in Table [2] under “MPII Composites”. 

This script corpus provides much more variation 
than the limited number of video training examples can 
capture. Of course this also poses a challenge, because 
we need to overcome the problem of different wordings 
and coordinated events: Table [3] shows three examples 
we collected for the composite activity preparing cu¬ 
cumber. They differ in verbalization (e.g. slice, chop, 
and make a slice) and granularity ( getting something is 
often left out). Further, the sequences reflect different 


3 http: //www.mturk.com 

4 http://www.jamieshomecookingskills.com 


ways of preparing the vegetable, some include peeling it, 
some do not wash it, and so on. Some sentences contain 
conjugated events ( take a knife and rock it...). While 
we clean the data to a certain degree by fixing spelling 
mistakes and resolving pronouns with the method from 


Bloern et al. (2012J, we end up with both challenges 


and blessings of a noisy but big script corpus. 

In Section 16.41 we will describe how we extract se¬ 
mantic relatedness from this data. 


4 Hand detection and pose estimation 


One goal of this paper is to investigate the applicability 
of state-of-the-art pose estimation methods in the con¬ 
text of activity recognition. Therefore, in this section 
we propose our new pose estimation method based on 


Andriluka et al. (2011) and benchmark it on our data¬ 


set together with state-of-the-art pose estimation meth¬ 
ods. Another goal is to demonstrate the importance of 
hand-based features for recognizing activities and their 
participants. For this we need to localize hands, which 
is in itself a challenging task due to partial occlusions, 
obstruction by manipulated objects, and variability of 
hand postures. In order to achieve high quality hand 
localization we leverage two complementary sources of 
information. We exploit the characteristic appearance 
of hands in order to train an effective hand detector. 
We then integrate observations from this detector in 
our pose estimation approach to take advantage of the 
context provided by the other body parts. As another 
finding, we show that localization of all body parts ben¬ 
efits significantly from our specialized hand detector. 

In the following we introduce our hand detector 
(Section |4jj ) and pose estimation method (Section [472] ) 
as well as how we combine them (Section |4.3|). In Sec¬ 


tion 4.4 we evaluate our proposed approaches as well as 
state-of-the-art pose estimation methods on our data¬ 
set. 


4.1 Hand detection based on local appearance 


As a basis for our hand detector we rely on the de¬ 
formable part models (DPM, Felzenszwalb et al. 2010). 
We discuss several design choices in order to achieve 
best performance. 


Detection of left and right hands. We aim for a hand 
detector that can correctly distinguish the left and right 
hand of a person. The rationale behind this is that for 
many activities left and right hands have different roles 
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body is represented as a collection of rigid parts linked 
via a set of pairwise part relationships. Unlike the orig¬ 
inal model we define a flexible variant of the PS model 
(FPS) that consists of N = 10 parts corresponding to 
head, torso, as well as left and right shoulders, elbows, 
wrists and hands. Denoting the configuration of parts 
as L = Zi, ..., Zjv» and image observations as D , the 
posterior over the part configuration is given by 

i=N 

p(l\d) oc n w 

(i,j)eE »=i 


Figure 3 Examples of training images assigned to 4 different 
hand components, each row shows images from one compo¬ 
nent. Rows 1 and 2 correspond to right hand components, 
and rows 3 and 4 to left hand components. 

(e.g. for a cutting activity the dominant hand is typi¬ 
cally holding a knife while the supporting hand is hold¬ 
ing the object that is being cut). Further, we would like 
to avoid situations when two strong hypotheses for one 
of the hands are chosen over two hypotheses for both 


where E is a set of connected part pairs. We build on the 


publicly available PS implementation from Andriluka 


et al. (2011). In this model the pairwise connections 


between parts form a tree structure, which permits effi¬ 
cient and exact inference. The pairwise terms represent 
the spatial relationships between part positions and are 
modeled as Gaussians with respect to relative position 
and orientation of parts. The appearance of individ¬ 
ual parts is represented with boosted part detectors 
and shape context image features. Conceptually the for- 


components to left and right hands and jointly training 

mulation of 

Andriluka et al. (2011 

) is similar to flexi- 

them within the same detector (see examples in Figure 

ble mixture of parts model (FMP, 

Yang and Ramanan 


[3]). Note that in contrast to the default setting mirror¬ 
ing is switched off in DPM. At test time we pick the 
best scoring hypothesis among the components corre¬ 
sponding to left and right hands. 

Component initialization. We capture the variance of 
hand postures by decomposing the hands’ appearance 
into multiple modes and representing each mode with a 
specific DPM component. We found that a rather large 
number of components is necessary to achieve good de¬ 
tection performance. We initialize the components by 
clustering the HOG descriptors of the training exam¬ 


ples using K-means as in Divvala et al. (2012). The de¬ 


tection further improves by first clustering the training 
examples by hand orientation and then by HOG. 

Body context. We improve the hand localization by aug¬ 
menting the hand detector with the context provided 
by a person detector. We rely on the person detector 
to constrain the search for hands to the image locations 
within the extended person bounding box and also con¬ 
strain the scale of the hands detector to the scale of the 
person hypothesis. 


4.2 Pose estimation 


We base our pose estimation approach on the picto¬ 


rial structures (PS) approach (Fischler and Elschlager 
1973; Felzenszwalb and Huttenlocher 20051. In PS the 


2011). The FMP model represents appearance of each 


body part with a set of HOG templates. Pairwise terms 
are adapted depending on the particular template. Pa¬ 
rameters of appearance templates and pairwise terms of 
the FMP model are jointly trained using max-margin 


objective. The model of Andriluka et al. (2011) relies on 


a single appearance template for all parts. Parameters 
of pairwise terms are estimated using maximum likeli¬ 
hood independently from appearance terms. We extend 
this model by incorporating color features into the part 
likelihoods by stacking them with shape context fea¬ 
tures prior to part detector training. We encode the 
color as a multidimensional histogram in RGB space 
using 10 bins for each color dimension which results in 
1000 dimensional feature vectors. We then concatenate 
color and shape context features and train boosted part 
detectors for each part using the combined representa¬ 
tion. We use standard AdaBoost for training and rely 


on the same weak learners as in Andriluka et al. (2011). 


4.3 Combining hand detection and pose estimation 

We extend the image observations in Eq. [T] with detec¬ 
tion hypotheses for left and right hands, which we ob¬ 
tain using the corresponding components of our hand 
detector. We denote the set of hand hypotheses pro¬ 
duced by our hand detector by H = {(dk,Sk)\k = 
1,..., K}, where dk is the image position and Sk the 
detection score. Based on this sparse set of detections 







































12 


Marcus Rohrbach et al. 


Method 


Torso Head 

upper 

r 

arm 

1 

lower 

r 

arm 

1 

All 

Original models 

CPS Sapp et al.| d2010|) 


67.1 

0.0 

53.4 

48.6 

47.3 

37.0 

42.2 

FMP Yang and Ramanan 

(20111 

63.9 

72.1 

60.2 

59.6 

42.1 

46.7 

57.4 

P8 Andriluka et al. (Z0U9 


58.0 

45.5 

50.5 

57.2 

43.3 

38.8 

48.9 


Trained on our data 


FMP Yang and Ramanan 

(j201lb 

'— 1 

79.6 

80.1 

67.7 

80.0 

60.7 

67.8 

60.8 

69.6 

50.1 

48.9 

50.3 

49.6 

61.5 

66.0 

PS |Andriluka et al. (2009 


FPS 



78.5 

79.4 

61.9 

64.1 

62.4 

61.0 

67.9 

FPS + data 



79.3 

85.0 

64.3 

64.6 

60.0 

59.8 

68.8 

FPS + data 

+ hand det 


79.6 

84.9 

70.9 

70.0 

73.5 

70.2 

74.9 

FPS + data 

+ color 


80.7 

85.8 

69.1 

67.4 

69.3 

65.5 

73.0 

FPS + data 

+ hand det + color 

81.3 

86.1 72.4 

71.3 74.4 70.3 75.9 


(a) 


R-Hand L-Hand 



Threshold (pixels) Threshold (pixels) 

(b) 


Figure 4 (a) 2D upper body pose estimation results on the “Pose Challenge” of our dataset. The numbers correspond to 

the “percentage of correct parts” (PCP). (b) Accuracy of different methods for detection of right and left hands for a varying 
distance (in pixels) from the ground truth position. 


we obtain a dense likelihood map for the hand part lh 
using a kernel density estimate: 

K 

p(H\l h ) = Wk exp(—cr 2 ||dfc - Z/J| 2 ), (2) 

k =1 

where Wk = Sk — to is a positive weight associated 
with each hand hypothesis computed by shifting the 
detection score by the minimal score value to. There 
is no specific upper/lower bound for the scores Sk, but 
since DMP relies on SVM formulation the scores tend to 
be centered around 0 with confident negative examples 
having score less than -1. In practice we set to = — 1 
and ignore all detections with a smaller score than to. 


4.4 Evaluation: pose estimation and hand detection 


We first evaluate the results on the upper-body pose 
estimation task. In order to identify the best 2D pose 
estimation approach we use our 2D body joint anno¬ 


tations (see Section 3.31. For evaluating these meth¬ 


ods we adopt the PCP measure (percentage of correct 


parts) proposed by |Ferrari et al. (20081. The results 
are shown in Figure |4 (a) The first three lines compare 
three state-of-the-art methods: the cascaded pictorial 


structures (CPS, Sapp et al. 2010), the flexible mix¬ 
ture of parts model (FMP, Yang and Ramanan 2011) 
and the implementation of pictorial structures model 
(PS, Andriluka et al. 2011), using their published pose 
models. Lines 4 and 5 show the models of |Yang and] 
IR.amananl and lAndriluka et all retrained on our data. 
Overall the model of [Andriluka et al.| performs best, 
achieving 66.0 PCP for all body-parts. We attribute 
the improvement of PS over FMP to the following. The 
FMP model encodes different orientation of parts via 
different appearance templates, whereas the PS model 


uses a single template that is rotation invariant and is 
evaluated at all orientations. The FMP model has a 
larger number of parameters because appearance tem¬ 
plates are not shared across different part orientations. 
A larger number of parameters means that it is easier to 
overfit the FMP model than the PS model. This could 
explain the performance differences after retraining on 
our data. It could also be that finer discretization of 
body part orientations in the PS model compared to 
the FMP model is important for good performance. As 
described above we base our model (FPS) on PS, adding 
to it flexible part configuration. 


The bottom part of the Figure 4(a) shows that this 
as well as our other improvements (more training data 
comparing to Rohrbach et al. (2012a), color features, 
and hand detections) in the model each helps to im¬ 
prove performance. Overall, compared to PS, we achieve 
an improvement from 66.0 to 75.9 PCP and most no¬ 
tably an improvement from 48.9 to 74.4 and from 49.6 
to 70.3 for lower arms, which are most important for 
recognizing hand-centric activities. We also would like 
to point to the benefit which hand detectors have to 
pose estimation (compare line 7 vs 8 and 9 vs 10). 


Next we discuss the hand detection results. Our fi¬ 
nal hand detector hancLDPM is based on 32 components 
with 16 components allocated to each of the hands. The 
components are initialized by first grouping the training 
examples of each hand into 4 discrete orientations, and 
then clustering their HOG descriptors. In the experi¬ 
ments on hand localization we use a metric that reflects 
the localization accuracy and measures the percentage 
of hand hypotheses within a given distance from the 
ground truth. We visualize the results by plotting the 
localization accuracy for a range of distances. 

Figure |4(b)| presents the evaluation of the local¬ 
ization accuracy of both hands. We observe that our 






































































Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data 


13 



Figure 5 Pose helps to resolve failure cases of hand localiza¬ 
tion (upper row - handDPM, lower row is FPS+data+hand 
det+color). 


hand detector (handDPM, red-dashed curve) alone al¬ 
ready significantly improves over the proposed FPS ap¬ 
proach (black-dotted-triangles). The performance fur¬ 
ther improves when hand detection hypotheses are in¬ 
tegrated within the pose estimation model (blue-solid- 
stars). However, the improvement is moderate, likely 
because the pose estimation approach is not optimized 
specifically for hand detection and has to compromise 
between localization of hands and other body parts. 
Some qualitative examples are shown in Figure [5j 


5 Approaches for fine-grained activity 
recognition and detection 


In this section we focus on fine-grained activity recog¬ 
nition to approach the challenges typical e.g. for as¬ 
sisted daily living. Along with the activities we want to 
recognize their participating objects. To better under¬ 
stand the state-of-the-art for this challenging task we 
benchmark three types of approaches on our new data¬ 
set. The first type (Section 5.1) uses features derived 
from upper body model motivated by the intuition that 
human body configurations and human body motion 
should provide strong cues for activity recognition. For 
body pose estimation we rely on our approach described 
in Sections 4.2 and 4.3 The second type (Section 5.2) 


are the state-of-the-art Dense Trajectories (Wang et al. 


2013a) which have shown promising results on various 


datasets. It is a holistic approach in a sense that it ex¬ 
tracts visual features on the entire frame. As the third 
type (Section 5.3) we present our hand-centric visual 
features, targeted at recognizing our hand-centric ac¬ 
tivities and the participating objects which are typi¬ 
cally in the hand neighbourhood. For this we propose 
a hand detector (Sections 4.1 4.3). Finally, we discuss 
our approaches to activity classification and detection 
in Section [HH 


5.1 Pose-based approach 


We also compare our hand detector to a state-of-the- Pose-based activity recognition approaches were shown 


art hand detector of Mittal et al. (20111 using the code to be effective using inertial sensors (Zinnen et al. 2009). 


made publicly available by the authors. We perform the 
best-case evaluation and assign the hand hypothesis re¬ 
turned by the approach to the closest left and right 
hand in the ground-truth, as the hand detector does 
not differentiate between left and right hands. For a fair 
comparison we also filter the hand detections of |Mittal~| 
et al. (2011) at irrelevant scales and image locations 
using body context as explained before. Our detector 
significantly improves over the hand detector of |Mittai| 
et al. (2011), which in addition to hand appearance also 
relies on color and context features, whereas our hand 
detector uses hand regions only. Note that there are sig¬ 
nificant differences between localization accuracy of left 
and right hands. We attribute this to the fact that the 
majority of people in our database are right handed. 
Since people perform many activities with their dom¬ 
inant hand, the pose of the right hand is more likely 
to be constrained by various activities due to the use 
of tools such as a knife or peeler. The left hand’s pose 
is far less deterministic and the hand is often occluded 
behind the counter or while holding various objects. 


Inspired by Zinnen et al. (2009) we build on a similar 


feature set, computing it from the temporal sequence 
of 2D body configurations. 


We employ a person detector (Felzenszwalb et al. 


2010) and estimate the pose of the person within the 


detected region with 50% border around. This allows 
us to reduce the complexity of the pose estimation and 
simplifies the search to a single scale. To extract the 
trajectories of body joints we rely on search space re¬ 
duction ( Ferrari et al.| 2008) and tracking. To that end 
we first estimate poses over a sparse set of frames (ev¬ 
ery 10-th frame in our evaluation) and then track over 
a fixed temporal neighborhood of 50 frames forward 
and backward. For tracking we match SIFT features for 
each joint separately across consecutive frames. To dis¬ 
card outliers we find the largest group of features with 
coherent motion and update the joint position based 
on the motion of this group. This approach combines 
the generic appearance model learned at training time 
with the specific appearance (SIFT) features computed 
at test time. 
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Given the body joint trajectories we compute two 
different feature representations. First is a manually de¬ 
fined statistics over the body model trajectories, which 
we refer to as body model features (BM). Second is 
Fourier transform features (FFT) from |Zinnen et al. 
(2009), which have shown effective for recognizing ac¬ 


tivities from body worn wearable sensors. 


Body model features (BM). For the BM features we 
compute the velocity of all joints (similar to gradient 
calculation in the image domain). We bin it in an 8- 
bin histogram according to its direction, weighted by 
the speed (in pixels/frame). This is similar to the ap¬ 


proach by Messing et al. (2009) which additionally bins 


the velocity’s magnitude. We repeat this by computing 
acceleration of each joint. Additionally we compute dis¬ 
tances between the right and corresponding left joints 
as well as between all 4 joints on each body half. Simi¬ 
lar to the joint trajectories (i.e. trajectories of x,y val¬ 
ues) we build corresponding “trajectories” of distance 
values by stacking the values over temporally adjacent 
frames. For each distance trajectory we compute statis¬ 
tics (mean, median, standard deviation, minimum, and 
maximum) as well as a rate of change histogram, simi¬ 
lar to velocity. Last, we compute the angle trajectories 
at all inner joints (wrists, elbows, shoulders) and use 
the statistics (mean etc.) of the angle and angle speed 
trajectories. This totals to 556 dimensions. 


Fourier transform features (FFT). The FFT feature 
contains 4 exponential bands, 10 cepstral coefficients, 
and the spectral entropy and energy for each x and y 
coordinate trajectory of all joints, giving a total of 256 
dimensions. 


Feature representation. For both features (BM and FFT) 
we compute a separate codebook for each distinct sub¬ 
feature (i.e. velocity, acceleration, exponential bands 
etc.) which we found to be more robust than a single 
codebook. We set the codebook size to twice the respec¬ 
tive feature dimension, which is created by computing 
k-means from all features (over 80,000). We compute 
both features for trajectories of length 20, 50, and 100 
(centered at the frame where pose was detected) to al¬ 
low for different motion lengths. The resulting features 
for different trajectory lengths are combined by stack¬ 
ing and give a total feature dimension of 3,336 for BM 
and 1,536 for FFT. 


5.2 Holistic approach 

Most approaches for activity recognition are based on 
a bag-of-words representations. We pick the state-of- 


the-art Dense Trajectories approach (Wang et al.|2011 


2013a) which extracts histograms of oriented gradients 


(HOG), flow (HOF Laptev et al. 2008), and motion 
boundary histograms (MBH Dalai et al. 2006) around 
densely sampled points, which are tracked for 15 frames 
by median filtering in a dense optical flow field. The x 
and y trajectory speed is used as a fourth feature. Using 
their code and parameters which showed state-of-the- 
art performance on several datasets we extract these 
features on our data. Following Wang et al. (2013a) we 
generate a codebook for each of the four features of 
4,000 words using k-means from over a million sampled 
features. 


5.3 Hand-centric approach 

In domains where people mainly perform hand-related 
activities it seems intuitive to expect that hand regions 
contain important and relevant information for recog¬ 
nizing those activities and the participating objects. 
Thus, in addition to using the holistic and pose-based 
features, we suggest to focus on the hand regions. To ob¬ 
tain the hand locations we rely on our hand detector de¬ 
scribed in Section [Tl] as well as on the pose estimation 
method with integrated hand candidates (Section |4.3[ ). 
In order to increase the robustness of the method we 
use both location candidates (provided by the hand- 
DPM detector and the final pose model) and sum the 
obtained features. 


Hand-Trajectories We want to represent different type 
of information: hand motion, hand shape, and shape 
variations over time, as well as the appearance of ob¬ 
jects manipulated by the hands. We propose to densely 
sample the neighborhood of each hand and to track 
those points over time. For tracking and also repre¬ 
senting the point trajectories with powerful features 
we adapt the approach of Wang et al. (2013a). We 
focus only on densely sampled points around the es¬ 
timated hand positions instead of sampling the entire 
video frame. We specify a bounding box around each 
hand detection and densely sample points inside of it. 
In our experiment we use 120x140 pixels bounding 
box around hands to include the information about the 
hands’ context. We use 8 pixels grid spacing for points 
sampling and finally we get 136 interest point tracks for 
each frame. After extracting the features along com¬ 
puted tracks we create codebooks that contain 4000 
words per feature. 


Hand-cSift Color information is another important cue 
for recognizing activities and even more prominent for 
recognizing the participating objects. Similar to the 
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previous approach we densely sample the points in the 
hands’ neighborhood and extract color Sift features on 
4 channels (RGB+grey). We quantize them in a code¬ 
book of size 4000. 


5.4 Fine-grained activity classification and detection 

Activity classification Given a long video we assume 
that it consists of multiple time intervals. Each such 
interval t depicts a single fine-grained activity and its 
participating objects (e.g. dry, hands, towel). In the fol¬ 
lowing we refer to both, activities and participants, as 
activity attributes ai,(i £ {1, ..., n}), i.e. a* can be any 
attribute including cut, knife, or cucumber. We train 
one-vs-all SVM classifiers on the features described in 
the previous sections given the ground truth intervals 
and labels. The classifiers provide us with real valued 
confidence score functions f^ ase : i— > R for attribute 

a,; and feature vectors of dimension N. Combining dif¬ 
ferent features is achieved by concatenating, i.e. stack¬ 
ing, the corresponding feature vectors. 

Activity detection While we use ground truth intervals 
for training the activity classifiers, we use a sliding win¬ 
dow approach to find the correct interval of detection. 
To efficiently compute features of a sliding window we 
build an integral histogram over the histogram of the 
codebook features. We use non maximum suppression 
over different window lengths and start with the maxi¬ 
mum score and remove all overlapping windows. In the 
detection experiments we use a minimum window size 
of 30 with a step size of 6 frames; we increase window 
and step size by a factor of \/2 until we reach a window 
size of 1800 frames (about 1 minute). Although this 
will still not cover all possible frame configurations, we 
found it to be a good trade-off between performance 
and computational costs. 


6 Modeling composite activities 

In the previous section we discussed how we recognize 
fine-grained activities (such as peeling or washing) and 
their object participants (such as grater, knife, or cu¬ 
cumber). Now we focus on exploiting the temporal con¬ 
text and on recognizing different composite activities, 
e.g. preparing a cucumber or cooking pasta. 

For this, we first show how we exploit temporal 
context and co-occurrence to improve the recognition 
of fine-grained activities and their object participants 
(Section [6T| . Then, we model composite activities as a 
flexible combination of attributes, where attributes re¬ 
fer jointly to the fine-grained activities and their object 


participants (Section 6.2). We then show how to use 
prior knowledge (Section 6.3) to improve the recogni¬ 
tion of composite activities, overcoming the notorious 
lack of training data and handling the large variability 
of composite activities. In Section [fo4| we discuss how to 
mine the semantic relatedness from script data. Finally, 
in Section 6.5 we introduce an automatic approach to 
temporal video segmentation, which removes the neces¬ 
sity to manually annotate the ground truth intervals in 
a video. 


6.1 Recognizing activity attributes using context and 
co-occurrence 

For a time interval t we want to classify if a particular 
fine-grained activity and its participants are present. 
We refer to activities and participants as activity at¬ 
tributes at. We distinguish three types of attribute clas¬ 
sifiers. The first type of is given by the classifiers intro¬ 
duced in the previous section providing us with confi¬ 
dence score functions f^ ase ; R w i—>• R for each attribute 
Oj. Let us denote the score of a given feature vector x t 
at time interval t as: 

Si,t = f? ase (xt). (3) 

Together these score constitute a matrix S of dimen¬ 
sions n x T (ff attributes x ^timestamps). Based on 
these scores, we define features for context (in the same 
video sequence) as well as features for co-occurrence of 
other attributes (in the same time interval t). 

Contextual features formalize the intuition that ad¬ 
jacent time frames have strongly related attributes: e.g. 
if a cucumber is peeled in one time interval, then cut¬ 
ting the cucumber is probably also present in the same 
video sequence. As visualized in Figure [6(a)| we define 
a context feature g^ on : R nxT i —> R” at time t by max 
pooling the scores of each attribute over all time inter¬ 
vals except t: 

g c t on (S)= max s u (4) 

where max is an element-wise operator over all columns 
s u £ R n of matrix S. 

Similarly, activity attributes happening at the same 
time interval t are related, e.g. if we peel something it 
is more likely to observe also carrot or cucumber rather 
than cauliflower. We thus define the co-occurrence as 
a feature g™ occ : R ra R” _1 by stacking all attribute 
scores at time t excluding s l _ t . 

9i° occ (s t ) = [s M ;...; Si-i,*; ...; s„, t ], (5) 

where s* £ R ra is a column of matrix S. 
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(a) Activity attribute recognition using contextual and co- (b) Composite activity classification using max-pooled activ- 


occurrence attributes vectors. 


ity attributes. 


Figure 6 Our approach to recognition of attributes (a) and composite activities (b). 


Based on these features we train activity attribute 
SVM classifiers using the features individually or by 
stacking them. Specifically we obtain corresponding con¬ 
fidence score functions for context: /?°" : R" i—»• R and 
co-occurrence: /f oocc : R n_1 i—>• R, where i denotes that 
a separate function for each attribute a* is trained. We 
define corresponding scores as: 


s con = f con {g con {S)) 

and 


( 6 ) 


coocc 

b i,t 


f cooce {g coocc {st)) _ 


(7) 


where max is an element-wise operator over all columns 
St € R n of matrix S. 

To decide on the class z of a sequence d we use 
the feature g seq and classify it using a nearest neighbor 
classifier (NN) or a one-versus-all SVM given a set of 
labeled training sequences. The SVM classifier provides 
us with the following confidence function for all com¬ 
posite classes z: f!! eq : R ra R, where the final score is 
defined as: 

= f! eq (g seq (s d )), (9) 


This formulation can be easily extended to other at¬ 
tribute representations depending on the task and avail¬ 
able features. 


where S d is the score matrix for sequence d. The fol¬ 
lowing sections describe alternatives to NN and SVM 
to incorporate prior knowledge mined from script data. 


6.2 Composite activity classification using activity 

attributes 6.3 Script data for recognizing composite activities 


We now want to classify composite activities that span 
an entire video sequence, given attribute classifier scores. 
We note that we can use any of the scores introduced 
in the previous section s°° ra , s™ occ or their stacked 
combination). In the following for simplicity we refer to 
these scores as s dt and corresponding matrix as S. In 
this approach we rely on the representation that cap¬ 
tures likelihoods of the presence or absence of a particu¬ 
lar attribute and leave modeling the temporal ordering 
of attributes for future work. We define a feature for the 
video sequence as g seq : R" xT i —> R" by max pooling 
the scores of each attribute over all time intervals (see 
Figure [6(b)| ): 

g seq (S) = max St (8) 

te{i,...,T} 


Composite activities show a high diversity which is prac¬ 
tically impossible to capture in a training corpus. Our 
system thus needs to be robust against many activity 
variants that are not present in the training data. The 
use of attributes allows to include external knowledge 
to determine relevant attributes for a given composite 
activity. For this we assume associations between at¬ 
tribute at and composite activity class z in a matrix 
of weights w Zj i, with Z being the number of compos¬ 
ite activity classes. The vectors w z are LI normalized, 
i.e. ™z,i = 7- Our system extracts those associ¬ 

ations from script data (see Section 6.4), but the ap¬ 
proach generalizes to other arbitrary external knowl¬ 
edge sources. We explore three options to use such in¬ 
formation which we detail in the following. 
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Script data: We compute the confidence f^ cri ptdata . 
R ra i-> K. of a sequence being of the composite activity z 
using the attribute-based feature representation g seq {S ) 
introduced in Equation (| 8 |. Given the weights w z ^ we 
compute a weighted sum: 

n 

f scH P Uata {g se q[s)) = £ Wz ^ q {S ). (10) 

2=1 

For a specific sequence d with corresponding score ma¬ 
trix Sd we get the following score: 

S scriptdata = jscr iptdata^seq^y (H) 


This formulation is similar to the sum formulation 


we used in (Rohrbach et al. 20111 for image recogni¬ 
tion with attributes, which itself is an adaption of the 


direct attribute prediction model introduced by Lam- 


pert et al. (2013). Note that the weight matrix retrieved 


from script data is sparse (most w Zj i = 0). When min¬ 
ing from other corpora one might need to threshold the 
weights w Zt i, setting all others to zero, to achieve good 
performance as done e.g. in (Rohrbach et al.||2011). 


NN+script data: When training data is available we 
can use a nearest neighbor classifier. Often, only a hand¬ 
ful of attributes are likely to be indicative for a compos¬ 
ite activity class, while the majority of other attributes 
will provide irrelevant, potentially noisy information. 
When searching for nearest neighbors such irrelevant 
attributes might dominate the distance, resulting in 
suboptimal performance. To reduce this effect we rely 
on the script data to constrain the attribute feature 
vector to the relevant dimensions. 

More specifically, we replace the L2 norm for com¬ 
puting the distance of nearest neighbor with the follow¬ 
ing training class dependent weighted L2 norm. It takes 
weights of class-attribute associations into account. It is 
defined between the test attribute vector of unseen class 
g seq (S t est) and the training attribute vector g seq (Sf rain ) 
of class z as: 


Propagated semantic transfer (PST): As the third ap¬ 
proach to integrate external knowledge from script data 
we use Propagated semantic transfer (PST) which we 


proposed in (Rohrbach et al. 2013a) and summarize 


shortly in the following. The approach builds on Equa¬ 


tion (101 and uses label propagation to exploit the dis¬ 


tances within the unlabeled data, i.e. it assumes a trans- 
ductive setting where all test data is available when 
predicting a single test label. 

We can incorporate (partially) labeled training data 
l z ,d £ { 0 , 1 , 0 } for class 2 and sequence d. 0 denotes that 
we do not have a label for this sequence and class. We 
combine the labels with the predictions in the following 
way, using only the most reliable predictions s zd 
(top-d fraction) per class z\ 


PST 

s z,d 




7 lz,d 

(1 'y')^scriptdata 


if h,d £ { 0 , 1 } 

if among top-<5 fraction 
of predictions for class 2 
otherwise. 

(13) 


7 provides a weighting between the true labels and the 
predicted labels. In the zero-shot case we only use pre¬ 
dictions and 7 = 0. The parameters S, 7 £ [0,1] are 
chosen, similar to the remaining parameters, on the val¬ 
idation set. For zero-shot we use the unlabeled training 
data as additional data for label propagation. 

For computing the distance between the sequences 
we use the feature representation g seq (S), as for the 
AW-classifier, which is much lower dimensional than the 
raw video feature representation and provides more reli¬ 
able distances as we showed in (Rohrbach et al. 201.3a:-. 
We build a k-NN graph by connecting the k closest 
neighbours. We set the weights of the graph edges be¬ 
tween sequences d and e to exp(—0.5a°- 5 \\g seq (Sd) — 
g seq {S e )||), where a is set to the mean of the distances 
to the nearest neighbours. We initialize this graph with 
the scores s() d T and propagate them using label prop¬ 


agation from Zhou et al. (2004). 


Dist(S tes t, S train .) 

= (^2 w z,i {gt eq (Stest) - 9% eq (Strain)) 2 


0.5 


( 12 ) 


vi=l 


To enhance robustness further, we binarize all associ¬ 
ation weights w zd by setting all non-zero weights to 1 
(and Ll-normalize w z ). This reduces the distance com¬ 
putation to the relevant attributes, normalized by the 
total number of relevant attributes. 


6.4 Prior knowledge from script data 


We want to quantify what activities and objects typ¬ 
ically occur in a composite activity by leveraging the 
script data we collected (see Section 3.4). In order to 
use prior knowledge from textual script data, we have to 
match the (controlled) attribute labels from the video 
annotations to the (freely) written script instances (Sec¬ 
tion 6.4.1). Based on the matched attributes we com¬ 


pute two different word frequency statistics (Section|6.4.2[). 
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6-4-1 Label matching 


6.5 Automatic temporal segmentation 


To transfer any kind of knowledge from the script cor¬ 
pus to the attributes in the video annotation, we need 
to match attribute labels to natural language descrip¬ 
tions. The annotated attribute labels are standard En¬ 
glish verbs (for activities, wash) and nouns (for partici¬ 
pating objects, carrot ), sometimes with additional par¬ 
ticles ( take apart and take out). As the script instances 
contain freely written natural language sentences, they 
do not necessarily have any correspondence with the 
attribute label annotations. We compare two strategies 
for mapping annotations to script data sentences: 


literal: we look for the exact matching of the at¬ 
tribute label within the data. 

WordNet: we look for attribute labels and their 
synonyms. We take synonyms as members of the 
same synset according to the WordNet ontology (iFell- 


baum 1998) and restrict them to words with the 


same part of speech, i.e. we match only verbal syn¬ 
onyms to activity predicates and only nouns to ob¬ 
ject terms. 


6-4-2 Statistics computed on the script data 


We compute two different association scores between 
attribute labels cq and composite activities z. For this 
we concatenate all scripts for a given composite z to a 
single document 8~. 


freq: word frequency /reg(aq, <5 Z ) for each attribute 
cq and composite activities z. 

tf*idf (term frequency * inverse document frequency, 


Salton and Buckley 19888 is a measure used in In¬ 


formation Retrieval to determine the relevance of a 
word for a document. Given a document collection 
D = {(5!,..., 5 Z , ..., <5 m }, tf*idf for a term or attribute 
eq and a document 8 Z is computed as follows: 


tfidf{ai,5 z ) 


freq(a i: S z ) * log 


\D\ 

| {<5 £ D : ai £ <5} | 
(14) 


where {<5 £ D : a,; £ <5} is the set of documents 
containing aq at least once. tf*idf represents the dis¬ 
tinctiveness of a term for a document: the value in¬ 
creases if the term occurs often in the document and 
rarely in other documents. 


We set ui 2ji = freq(ai,S z ) or w Zii = tfidf(a.i,S z ) and 
Ll-normalize all vectors w z . These weights w z ^ are then 
used in Equations (101 and (12) and subsequently also 
in our PST approach. 


While we assume a segmented video during training 
time to learn attribute classifiers as described in Sec¬ 
tion [A4j we want to segment the video automatically at 
test time. To avoid noisy and small segments we follow 
the idea we presented in ( Rohrbach et al.|2014 ), namely 
we employ agglomerative clustering. We start with uni¬ 
form intervals of 60 frames and describe each interval 
with an attribute-classifier score vector. We combine 
neighbouring intervals based on the cosine similarity of 
their score vectors and stop when we reach a thresh¬ 
old (found on the validation set). We aim for a seg¬ 
mentation with granularity similar to original manual 
annotation. After this a separately trained visual back¬ 
ground classifier removes irrelevant or noisy segments. 
In our experiments we show that this leads to compos¬ 
ite recognition results, similar to using the ground truth 
intervals for the attributes. 


7 Evaluation 

In this section we evaluate our approaches to fine-grained 
and composite activity recognition. We start with the 
fine-grained activity classification and detection and com¬ 
pare three types of approaches described in Section [5] 
namely pose-based, hand-centric and holistic approaches. 
Next we evaluate our approaches for composite activity 
recognition introduced in Section [6j evaluating our at¬ 
tributes enhanced with context and co-occurrence, the 
recognition of composite cooking activities using differ¬ 
ent levels of supervision, and the zero-shot approach 
using script data. 


7.1 Experimental Setup 

This section details our experimental setup. We will 
release evaluation code to reproduce and compare with 
our results. See Table[3]for the information on our train¬ 
ing/validation/test split. We estimate all hyper param¬ 
eters on the validation set and then retrain the models 
on the training and validation set with the best param¬ 
eters. 

7.1.1 Experimental setup fine-grained activity 
classification and detection 

In the fine-grained recognition task we want to distin¬ 
guish 67 fine-grained activities and 155 participating 
objects (see Table [7] for the lists of activities and ob¬ 
jects). To learn the visual classifiers we use the anno¬ 
tated ground truth intervals provided with the dataset. 
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We train one-vs-all SVMs using mean SGD (Rohrbach 


et al. 

2011 

with a > 

and Zisserman 

2010) 


point hit criterion to decide on the correctness of a de¬ 
tection, i.e. the midpoint of the detection has to be 
within the ground-truth. If a second detection fires for 
one ground-truth label, it is counted as false positive. In 
the following we report the mean over the average preci¬ 
sion (AP) of each class. Combining features is achieved 
by stacking the bag-of-word histograms. 


7.1.2 Experimental setup composite activity recognition 


For localizing attributes within composite activities we 
rely on our automatic segmentation (Section 6.5). We 
aim to recognize 31 composite activities (see bold names 
in Table [2]). 

We distinguish two cases for training the attributes 
with respect to composites. 


Attribute training on all composites. We use all avail¬ 
able 218 training+validation videos for training the 
attribute classifiers. See left half of Tables [8j[9j and 

m 

Attribute training on disjoint composites. We use all avail¬ 
able videos apart from those showing the test com¬ 
posite categories (in total 92 videos). This means 
that attributes and composites are trained on dis¬ 
joint sets of composite categories and thus also on 
disjoint sets of videos. This tests how well novel 
composite categories can be recognized without ad¬ 
ditional attribute labels. See right half of Tables [8j 
[9] and [TO) 

Next, we have two cases for training the composites. 


With training data for composites. We train on the 126 
training+validation videos whose category is in the 
set of the 31 test categories. Note that in case of At¬ 
tribute training on all composites the training videos 
are also part of the attribute training. See top part 
of Table [9] 

No training data for composites. Here we do not rely 
on any training labels for the composite activities. 
See bottom part of Tableland all of Table [l0| Com¬ 
bined with Attribute training on disjoint composites 
this is zero-shot recognition. 


7.2 Fine-grained activity classification and detection 

Activity classification We start with the classification 
results on fine-grained activities and their participants 
(Table [5). 


Approach 

Activities 

Objects 

All 

Pose-based approaches 

(1) BM 

18.9 

13.8 

15.7 

(2) FFT 

19.0 

16.2 

17.2 

(3) Combined 

24.1 

19.0 

20.8 

Hand-centric approaches 

(4) Hand-cSift 

23.0 

23.8 

23.5 

(5) Hand-Trajectories 

45.1 

31.5 

36.4 

(6) Combined 

43.5 

34.2 

37.5 

Holistic approach 

(7) Dense Trajectories 

44.5 

31.3 

36.1 

Combinations 

(8) Dense Traj,BM,FFT 

43.1 

30.7 

35.2 

(9) Dense Traj,Hand-Traj 

52.2 

37.7 

42.9 

(10) Dense Traj,Hand-Traj,-cSift 

51.2 

39.3 

43.7 


Table 5 Fine-grained activity and object classification re¬ 
sults, mean AP in % (see Section |7.2| for discussion). 


The body model features on the joint tracks (BM) 
achieve a mean average precision (AP) of 18.9% for ac¬ 
tivities and 13.8% for objects. Comparing this to the 
FFT features, we observe that FFT performs slightly 
better, improving over BM the AP by 0.1% and 2.4% 
respectively. The combination of BM and FFT features 
(line 3 in Table [5j yields a significant improvement, 
reaching AP of 24.1% for activities and 19.0% for ob¬ 
jects. We attribute this to the complementary informa¬ 
tion encoded in the features. While BM encodes among 
others velocity-histograms of the joint-tracks and statis¬ 
tics between tracks of different joints, FFT features en¬ 
code FFT coefficients of individual joints. Still, this is a 
relatively low performance. It can be explained, on one 
hand, by failures of the pose estimation method and, on 
the other hand, the pose-based features might not con¬ 
tain enough information to successfully distinguish the 
challenging fine-grained activities and participating ob¬ 
jects. Next we look at the performance of our proposed 
hand-centric features. Color Sift features, densely sam¬ 
pled in the hand neighborhood, allow us to improve the 
object recognition AP to 23.8% (Hand-cSift), indicat¬ 
ing their better suitability in particular for recognizing 
objects. Dense Trajectories features computed around 
hands (denoted as Hand-Trajectories) reach 45.1% and 
31.5% recognition AP for activities and objects, respec¬ 
tively. Combining both features leads to a small dis- 
improvement for activities, however it helps to further 
improve the object recognition performance to 34.2%. 
Overall our hand-centric approach reaches the recogni¬ 
tion AP of 37.5% for activities and objects together. 
The state-of-the-art holistic approach of Dense Trajec¬ 
tories (Wang et al. 2013aI obtains 44.5% and 31.3% 
recognition AP for activities and objects. If compared 
to our hand-centric features, this is slightly below the 
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Hand-Trajectories, which are restricted to the areas 
around hands. This supports our hypothesis that the 
most relevant information for recognizing our fine-grained 
activities is contained in the hand regions. We also con¬ 
sider several feature combinations (lines 8, 9, 10 in Ta¬ 
ble [5]). Combining Dense Trajectories with the pose- 
based features does not improve the recognition per¬ 
formance. However, combining them with Hand-Tra¬ 
jectories improves the activity recognition by 7.7% and 
object recognition by 6.4% (line 7 vs 9 in Table [5J. Fi¬ 
nally, adding the Hand-cSift features allows to reach 
the impressive 43.7% recognition AP for activities and 
objects together. 

The detailed comparison of Dense Trajectories, Hand- 
Trajectories and the final feature-combination (line 10 
in Table [ 5 ]) can be found in Table [%} Hand-Trajectories 
loose to Dense Trajectories on activities that include 
“coarser” motion, e.g. push down , hang or plug , and 
corresponding objects such as hook or teapot. Note that 
Hand-Trajectories outperform the Dense Trajectories 
for 35 activity classes, while in the opposite direction 
this holds only 25 times (for objects, respectively 65 
vs 43 times). This shows again that the hand-centric 
features consistently outperform the holistic features 
in both tasks. Some example cases where the hand¬ 
centric approach is significantly better, are such activ¬ 
ities as rip open, take apart , and grate and such ob¬ 
jects as cauliflower , oven, and cup. At the same time 
the final feature combination (line 10 in Table [ 5 ]) con¬ 
sistently outperforms both aforementioned features in 
about 60% of cases. We demonstrate some qualitative 
results comparing Dense Trajectories to the final fea¬ 
ture combination in Table El We also looked closer 
at the performance of other features, e.g. the combined 
pose features (line 3 in Table[5]) perform well on “coarser”, 
full-body activities, such as throw in garbage, take out, 
move, while rather poorly on more fine-grained activi¬ 
ties. On the other hand the Hand-cSift features are good 
in recognizing objects with distinct shapes/colors, e.g. 
pineapple, carrot, bowl, etc. 


Activity detection Next we look at the detection perfor¬ 
mance (Table [6]), which is inherently more challenging 
than the classification task. Here the BM features reach 
8.3% overall AP and FFT get 9.3%. Their combination 
(line 3 in Table [6]) gets 11.4% overall AP, while Hand- 
cSift only reaches 10.7%. Hand-Trajectories alone get 
16.6% AP and combined with Hand-cSift they reach 
22.5%, while the Dense Trajectories get 24.4% AP. As 
we can see for this task our hand-centric features per¬ 
form worse than holistic and even pose-based features 
(line 3 vs 4 in Table [6| . We believe the reason for this 
is that for correct segmentation of the video into activ¬ 


Approach 

Activities 

Objects 

All 

Pose-based approaches 

(1) BM 

9.7 

7.6 

8.3 

(2) FFT 

10.5 

8.7 

9.3 

(3) Combined 

14.3 

9.8 

11.4 

Hand-centric approaches 

(4) Hand-cSift 

10.5 

10.9 

10.7 

(5) Hand-Trajectories 

21.3 

14.0 

16.6 

(6) Combined 

26.0 

20.6 

22.5 

Holistic approach 

(7) Dense Trajectories 

29.5 

21.5 

24.4 

Combinations 

(8) Dense Traj,BM,FFT 

30.7 

21.5 

24.8 

(9) Dense Traj,Hand-Traj 

34.3 

25.2 

28.5 

(10) Dense Traj,Hand-Traj,-cSift 

34.5 

25.3 

28.6 


Table 6 Fine-grained activity and object detection results, 
mean AP in % (see Section |7.2| for discussion) 


ity intervals we need more holistic information, which 
the hand-centric features cannot provide, while pose- 
based and holistic features can capture it better. Simi¬ 
larly, when combining Dense Trajectories with the pose- 
based features (line 8 in Table [6| we observe a small 
improvement, supporting our hypothesis that pose in¬ 
deed helps to capture the detection boundaries. On 
the other hand, combining Dense Trajectories with our 
hand-centric features significantly improves the perfor¬ 
mance, in particular by 4.7% for activities and by 3.7% 
for objects (line 6 vs 9 in Table [6]). Combining the ob¬ 
tained features with the Hand-cSift further improves 
the results and we reach the 28.6% overall AP. The im¬ 
provement obtained after combining holistic and hand¬ 
centric features can be explained by the increased clas¬ 
sification AP within the obtained intervals. We thus 
conclude that for activity detection we require holistic 
information, which can come e.g. from the human pose. 
Combining the holistic and hand-centric features is still 
beneficial and significantly improves the performance. 


7.3 Context and co-occurrence for fine-grained 
activities 


While so far we looked at individual fine-grained activ¬ 
ities, we now evaluate the benefit from co-occurrence 

Table [8] pro¬ 


and context as introduced in Section 6.1 


vides the results for recognizing activities and their par¬ 
ticipants, modeled as attributes. We evaluate in two 
settings. The left two columns of Table [8] show the re¬ 
sults for training on all composites in training set, while 
the right two columns are trained only on composites 
absent in test set (Disjoint Composites), i.e. the second 
is a more challenging problem, as there is less training 




















Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data 


21 


Activity 

Dense 

Traj 

Hand 

Traj 

Combi 

+cSift 

Object 

Dense 

Traj 

Hand 

Traj 

Combi 

+cSift 

Object 

Dense 

Traj 

Hand 

Traj 

Combi 

+cSift 

add 

19.8 

16.3 

24.0 

apple 

- 

- 

- 

mango 

3.8 

7.0 

2.5 

arrange 

61.9 

32.1 

33.8 

arils 

19.8 

57.8 

12.5 

masher 

- 

- 

- 

change temperature 

69.1 

78.1 

75.4 

asparagus 

- 

- 

- 

measuring-pitcher 

0.7 

5.0 

5.3 

chop 

36.6 

35.4 

48.3 

avocado 

2.5 

4.3 

3.8 

measuring-spoon 

34.1 

12.6 

7.3 

clean 

32.0 

33.0 

33.3 

bag 

- 

- 

- 

milk 

0.4 

0.4 

0.4 

close 

76.3 

68.8 

77.0 

baking-paper 

- 

- 

- 

mortar 

- 

- 

- 

cut apart 

33.8 

36.2 

33.5 

baking-tray 

- 

- 

- 

mushroom 

- 

- 

- 

cut dice 

39.3 

45.7 

44.9 

blender 

- 

- 

- 

net-bag 

0.3 

0.2 

0.7 

cut off ends 

21.4 

52.0 

31.9 

bottle 

57.1 

49.3 

57.7 

oil 

52.3 

47.6 

55.6 

cut out inside 

2.2 

0.8 

2.0 

bowl 

34.7 

33.1 

49.0 

onion 

19.3 

20.4 

22.7 

cut stripes 

12.9 

13.0 

15.4 

box-grater 

- 

- 

- 

orange 

18.4 

11.1 

19.3 

cut 

28.3 

44.9 

27.2 

bread 

3.7 

6.5 

8.9 

oregano 

- 

- 

- 

dry 

81.9 

85.1 

84.5 

bread-knife 

3.0 

4.0 

8.1 

oven 

30.7 

73.4 

89.3 

enter 

100.0 

100.0 

100.0 

broccoli 

2.0 

2.3 

5.7 

paper 

- 

- 

- 

fin 

94.3 

90.8 

86.2 

bun 

1.2 

2.3 

8.5 

paper-bag 

20.5 

10.3 

33.0 

gather 

25.7 

23.8 

35.7 

bundle 

0.5 

1.1 

1.4 

paper-box 

1.0 

1.2 

3.6 

grate 

66.7 

100.0 

100.0 

butter 

6.2 

1.9 

9.6 

parsley 

23.4 

25.5 

49.6 

hang 

85.8 

57.2 

81.4 

carafe 

44.4 

46.7 

54.4 

pasta 

26.1 

16.0 

40.7 

mix 

10.3 

5.4 

52.9 

carrot 

26.5 

41.3 

64.9 

peach 

- 

- 

- 

move 

75.7 

75.7 

78.3 

cauliflower 

29.3 

68.9 

73.8 

pear 

- 

- 

- 

open close 

60.8 

65.7 

64.7 

cheese 

- 

- 

- 

peel 

40.3 

28.6 

35.2 

open egg 

50.0 

28.1 

39.2 

chefs-knife 

59.9 

73.3 

63.1 

pepper 

3.1 

14.4 

6.7 

open tin 

- 

- 

- 

chili 

0.6 

0.9 

1.3 

peppercorn 

- 

- 

- 

open 

22.0 

22.0 

34.5 

chive 

- 

- 

- 

pestle 

- 

- 

- 

package 

0.4 

1.6 

1.8 

chocolate 

- 

- 

- 

Philadelphia 

- 

- 

- 

peel 

55.0 

67.2 

58.6 

coffee 

3.3 

25.0 

100.0 

pineapple 

19.5 

47.0 

49.7 

plug 

41.6 

32.6 

81.0 

coffee-container 

34.6 

24.8 

73.4 

plastic-bag 

36.4 

37.7 

43.6 

pour 

44.8 

44.9 

45.1 

coffee-machine 

34.7 

65.1 

91.2 

plastic-bottle 

4.7 

2.8 

9.1 

pull apart 

38.7 

53.8 

45.2 

coffee-powder 

0.5 

1.3 

3.0 

plastic-box 

2.6 

9.0 

5.3 

pull up 

79.2 

21.7 

75.6 

colander 

63.4 

62.2 

77.9 

plastic-paper-bag 

0.9 

14.7 

19.6 

pull 

1.3 

9.1 

1.2 

cooking-spoon 

- 

- 

- 

plate 

65.7 

69.2 

73.9 

puree 

- 

- 

- 

corn 

- 

- 

- 

plum 

0.7 

2.5 

1.3 

purge 

0.1 

0.1 

0.6 

counter 

71.8 

70.3 

76.5 

pomegranate 

5.1 

0.8 

2.3 

push down 

30.7 

7.6 

28.0 

cream 

0.9 

0.5 

1.4 

pot 

84.3 

88.0 

91.1 

put in 

55.5 

50.8 

58.0 

cucumber 

4.3 

5.2 

4.1 

potato 

0.4 

0.4 

0.6 

put lid 

87.3 

85.3 

90.0 

cup 

27.0 

26.7 

43.6 

puree 

- 

- 

- 

put on 

6.2 

5.6 

1.2 

cupboard 

97.5 

98.0 

98.4 

raspberries 

- 

- 

- 

read 

5.1 

5.4 

5.6 

cutting-board 

84.4 

85.4 

88.9 

salad 

- 

- 

- 

remove from package 

19.3 

34.3 

31.5 

dough 

- 

- 

- 

salami 

- 

- 

- 

rip open 

2.8 

45.0 

100.0 

drawer 

98.2 

98.4 

98.5 

salt 

59.8 

48.7 

64.1 

scratch off 

30.7 

33.1 

31.9 

egg 

12.1 

3.6 

7.3 

seed 

- 

- 

- 

screw close 

77.3 

77.5 

77.5 

eggshell 

3.5 

3.6 

11.2 

side-peeler 

50.0 

11.7 

37.8 

screw open 

78.7 

69.4 

79.2 

electricity-column 

89.3 

82.3 

98.1 

sink 

47.0 

54.0 

53.9 

shake 

73.0 

75.7 

77.3 

electricity-plug 

74.3 

70.6 

87.7 

soup 

- 

- 

- 

shape 

- 

- 

- 

fig 

1.0 

1.0 

0.9 

spatula 

72.9 

76.2 

78.2 

slice 

47.2 

71.3 

57.4 

filter-basket 

1.3 

3.4 

13.1 

spice 

19.1 

13.3 

12.4 

smell 

49.7 

15.7 

33.0 

finger 

18.4 

15.4 

8.8 

spice-holder 

95.6 

94.4 

96.3 

spice 

88.6 

89.0 

89.2 

flat-grater 

31.7 

27.7 

40.9 

spice-shaker 

88.3 

87.3 

91.5 

spread 

87.1 

77.1 

96.7 

flower-pot 

- 

- 

- 

spinach 

- 

- 

- 

squeeze 

90.1 

92.9 

91.9 

food 

- 

- 

- 

sponge 

17.2 

45.4 

38.2 

stamp 

- 

- 

- 

fork 

8.7 

7.5 

10.5 

sponge-cloth 

67.1 

68.1 

75.0 

stir 

91.2 

81.9 

91.7 

fridge 

100.0 

99.8 

100.0 

spoon 

2.8 

5.9 

8.9 

strew 

1.7 

2.4 

2.4 

front-peeler 

21.8 

6.0 

17.6 

squeezer 

52.5 

67.0 

59.3 

take apart 

1.6 

32.1 

53.3 

frying-pan 

88.7 

91.9 

93.6 

stone 

0.2 

0.7 

0.7 

take lid 

66.2 

76.8 

71.7 

garbage 

13.7 

17.9 

27.5 

stove 

84.4 

87.2 

90.4 

take out 

94.1 

93.9 

95.1 

garlic-bulb 

0.3 

0.6 

0.8 

sugar 

22.0 

24.2 

29.0 

tap 

3.3 

4.2 

6.2 

garlic-clove 

11.7 

3.6 

9.3 

table-knife 

- 

- 

- 

taste 

9.4 

21.0 

22.0 

ginger 

1.9 

3.3 

3.6 

tap 

70.2 

71.8 

79.1 

test temperature 

11.3 

11.8 

35.1 

glass 

2.6 

4.5 

21.6 

tea-egg 

37.2 

28.7 

36.1 

throw in garbage 

96.7 

96.0 

97.1 

green-beans 

21.1 

24.6 

23.2 

tea-herbs 

60.5 

55.6 

91.1 

turn off 

7.4 

21.1 

33.0 

ham 

- 

- 

- 

teapot 

46.4 

6.7 

69.1 

turn on 

27.8 

30.6 

48.5 

hand 

95.9 

95.2 

96.4 

teaspoon 

29.2 

32.4 

36.5 

turn over 

- 

- 

- 

handle 

100.0 

9.1 

100.0 

tin 

- 

- 

- 

unplug 

8.7 

3.8 

20.0 

hook 

95.6 

71.2 

98.3 

tin-opener 

- 

- 

- 

wash 

93.4 

93.9 

93.7 

hot-chocolate- 

- 

- 

- 

tissue 

- 

- 

- 





powder-bag 








whip 

- 

- 

- 

hot-dog 

2.1 

2.7 

8.8 

toaster 

1.3 

8.1 

6.7 

wring out 

3.3 

4.5 

5.3 

jar 

5.4 

14.2 

17.8 

tomato 

- 

- 

- 





ketchup 

2.0 

3.1 

19.6 

tongs 

- 

- 

- 





kettle-power-base 

14.4 

9.8 

41.4 

top 

- 

- 

- 





kiwi 

1.1 

2.9 

1.5 

towel 

73.2 

76.9 

79.2 





knife 

69.6 

83.5 

76.8 

tube 

1.0 

9.5 

10.2 





knife-sharpener 

- 

- 

- 

water 

55.0 

46.9 

57.2 





kohlrabi 

- 

- 

- 

water-kettle 

40.7 

25.9 

53.7 





ladle 

- 

- 

- 

wire-whisk 

- 

- 

- 





leek 

10.6 

19.5 

17.6 

wrapping-paper 

2.9 

0.4 

2.0 





lemon 

- 

- 

- 

yolk 

0.5 

0.5 

0.3 





lid 

67.1 

70.8 

71.8 

zucchini 

- 

- 

- 





lime 

14.2 

3.7 

14.6 






Table 7 Fine-grained activities and object classification performance of Dense Trajectories, Hand Trajectories, and their 
combination including Hand-cSift (line 10 in Tablej5| for 67 fine-grained activities and 155 participating objects. AP in %. 
denotes that the category is not part of the test set and not evaluated. 
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Attribute training on: 

All 

Composites 

Disjoint 

Composites 


Dense 

Traj 

Combi 

+cSift 

Dense 

Traj 

Combi 

+cSift 

(1) Base ( s base ) 

36.1 

43.7 

33.5 

35.9 

(2) Context only (s COTl ) 

11.1 

12.6 

6.8 

8.1 

(3) Base+Context 

37.8 

41.2 

28.3 

32.3 

(4) Co-occ. only (s coocc ) 

38.1 

41.7 

32.6 

35.3 

(5) Base+Co-occ. 

38.1 

41.4 

32.7 

35.2 

(6) Base+Cont.+Co-occ. 

39.3 

41.5 

30.8 

32.6 


Table 8 Attribute recognition using context and co¬ 
occurrence, mean AP in %. Combi+cSift refers to Dense 
Traj,Hand-Traj,-cSift, see Section |7.3| for discussion. 

data and the attributes are tested in a different context. 
The performance in the first line is equivalent to the re¬ 
sults in Table [5| The very left column shows results on 
Dense Trajectories. More specifically using only tempo¬ 
ral context to recognize activity attributes performance 
drops from 36.1% AP for the base classifier to 11.1% 
AP. This is the expected result, because the context is 
similar for all activities of the same sequence and thus 
cannot discriminate attributes. In contrast, when using 
co-occurrence only (line 4 in Table [8]) , the performance 
increases by 2.0% compared to the base classifiers due 
to the high relatedness between the attributes, namely 
between activities and their participants. Combining 
context and co-occurrence information with the base 
classifier gives 37.8% and 38.1%, respectively. A com¬ 
bination of all training modes achieves a performance 
of 39.3% AP, improving the base classifier’s result by 
3.2%. While results for Dense Trajectories are as ex¬ 
pected i.e. adding context and co-occurrence improves 
performance, the performance drops slightly for the (in 
general) better performing combined features (second 
column). However, although the attribute prediction 
performance drops, we found that for recognizing the 
composites, context and co-occurrence are still useful. 

In the second setting, we restrict the training data¬ 
set to composites absent in the test set (right two columns 
of Table [8|, requiring the activity attributes to transfer 
to different composite activities. When comparing the 
right two the left columns, we notice a significant per¬ 
formance drop for all classifiers and both features. This 
decrease can mainly be attributed to the strong reduc¬ 
tion of training data to about one third. The base clas¬ 
sifier performs best and co-occurrence variants slightly 
below. Variants including context lead to tremendous 
performance drops in all combinations because the ac¬ 
tivity context changes from training to test (having dif¬ 
ferent composite activities). 


Attribute training on: 

All 

Composites 

Disjoint 

Composites 


Dense 

Traj 

Combi 

+cSift 

Dense 

Traj 

Combi 

+cSift 

With training data for composites 



Without attributes 





(1) SVM 

39.8 

41.1 

- 

- 

Attributes on gt intervals 





(2) SVM 

43.6 

52.3 

32.3 

34.9 

Attributes on automatic segmentation 



(3) SVM 

49.0 

56.9 

35.7 

34.8 

(4) NN 

42.1 

43.3 

24.7 

32.7 

(5) NN+Script data 

35.0 

40.4 

18.0 

21.9 

(6) PST+Script data 

54.5 

57.4 

32.2 

32.5 

No training data for composites 



Attributes on automatic segmentation 



(7) Script data 

36.7 

29.9 

19.6 

21.9 

(8) PST + Script data 

36.6 

43.8 

21.1 

19.3 


Table 9 Composite cooking activity classification, mean AP 
in %. Top left quarter: fully supervised, right column: reduced 
attribute training data, bottom section: no composite cooking 
activity training data, right bottom quarter: true zero shot. 
See Section El for discussion. 

7.4 Composite cooking activity classification 

After evaluating attribute recognition performance in 
Section |7.3[ we now show the results for recognizing 
composites as introduced in Section |6.2| From the dif¬ 
ferent attribute combination variants we only use the 
combination of base, context, and co-occurrence (last 
line in Table [8]) . Although this is not always the best 
choice for recognizing attributes we found it to work 
better or similar to alternatives for composite recogni¬ 
tion. The results are shown in Table [9j which, similar to 
Table [8j shows results for training the attributes on all 
composites, on the left, and reduced attribute training 
on non-test composites on the right. In the top section 
of the table we use training data for the composite cook¬ 
ing activities. In the bottom section of the table we use 
no training data for the composite cooking activities. 
This is enabled by the use of script data as motivated 
before. Disregarding the first line which does not use 
attributes at all and the second line which uses ground 
truth intervals for attributes, all other lines are based 
on attributes computed on our automatic temporal seg¬ 
mentation, introduced in Section [6.5| 

Examining the results in Table [9] we make several 
interesting observations. First, training composites on 
attributes of fine-grained activities and objects (line 3 
in Table|9| outperforms low-level features (line 1 in Ta¬ 
ble^, supporting our claim that for learning composite 
activities it is important to share information on an in¬ 
termediate level of attributes. 
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The second somewhat surprising observation is that 
recognizing composites based on our segmentation (line 
3 in Table [9]) outperforms using ground truth segments 
(line 2 in Table[9|. We attribute this to the fact that our 
segmentation is coarser than the ground truth and that 
we additionally remove noisy and background segments 
with a background classifier. This leads to more robust 
attributes and consequently better composite recogni¬ 
tion. This allows to have separate training sets for com¬ 
posites and attributes. This setting is explored in the 
top right quarter of Table[9] Here the training sequences 
for attributes are disjoint with the ones for composites, 
i.e. we do not require the attribute annotataions for the 
composite training set. 

Third, the improvements we achieved for fine-grained 
activities and object recognition by combining hand¬ 
centric with holistic features are still evident for com¬ 
posites. The Combination of Dense Trajectoreis, Hand- 
Trajectories, and Hand-cSift (2 nd , 4 th column) outper¬ 
forms in most cases Dense Trajectories only (1 st , 3 rd 
column), most notably in the setting “All Composites” 
for SVM (56.9% over 49.0% AP) and PST+Script data 
(43.8% over 36.6% AP). 

Fourth, using our Propagated Semantic Transfer (PST) 
approach is in most cases superior to other variants 
of incorporating script data (NN+Script data/ Script 
data). Most notably it reaches 57.5% AP for our com¬ 
bined feature. This is the overall best performance and 
also outperforms the SVM with 56.6% AP. PST slightly 
drops for the last number in table (19.3%), which we 
found is due to rather suboptimal parameters selected 
on the validations set. We note that in the scenario 
of Disjoint Composites (top right quarter of Table [9| 
PST+Script data is outperformed by training an SVM. 
We attribute this to the fact that the attributes are 
less robust in this scenario (see Table [8]) and the SVM 
can better adjust to that by learning which attributes 
are reliable and which not. NN and PST are based on 
distances between attribute score vectors, thus metric 
learning could be beneficial in these cases. 

Fifth, script data does not only allow to achieve the 
maximum performance but also allows transfer (bot¬ 
tom part of Table [9]) achieving in some cases results 
close to supervised approaches. The bottom right part 
of the table shows zero-shot recognition. Although here 
the performance cannot compete with the supervised 
setting, we like to point out that this is a very challeng¬ 
ing scenario, where attributes are trained on different 
composites, without composite training data, and the 
video stream has to be segmented automatically. 

Sixth, while in Table [9] we always used the variant 
tf*idf-WN for Script data, we show different variants of 
Script data for the case where they are not combined 


Attribute training on: 

All 

Composites 

Disjoint 

Composites 


Dense 

Combi 

Dense 

Combi 


Traj 

+cSift 

Traj 

+cSift 

No training data for 

composites 



Script data 
(1) freq-literal 

28.2 

30.5 

19.8 

24.1 

(2) freq-WN 

25.3 

28.6 

17.4 

20.3 

(3) tf*idf-literal 

35.9 

31.8 

20.0 

23.6 

(4) tf*idf-WN 

36.7 

29.9 

19.6 

21.9 


Table 10 Variants of script knowledge, AP in %. 
Combi+cSift refers to Dense Traj,Hand-Traj,-cSift. See Sec¬ 
tion EH for discussion. 

with NN or PST in Table flTil The main observation is 
that freq-WN performs in all cases worst, most likely 
the WordNet expansions make the results noisier. While 
in the first column the tf*idf-WN works best, there is 
overall no clear winner. However, when incorporated in 
PST, it is more important to select appropriate param¬ 
eters for PST on the validation set rather than selecting 
the right variant of Script data. 

Last, we want to look at an interesting comparison 
of the first line (SVM without attributes) versus line 
8 (PST + Script data), which effectively compares the 
settings “only composite labels” versus “only attribute 
labels” (+ Script data). Although the latter does not 
have any labels for the actual task of composite recog¬ 
nition it either performs close (in case of Dense Trajec¬ 
tories) or slightly better (for combined features). This 
indicates that our PST + Script data approach is very 
good in transferring information from the original task 
it was trained on to another which is very important 
for adaptation to novel situations, typical for assisted 
daily living scenarios. 

Table [TT] provides qualitative results for three com¬ 
posite videos including how they are decomposed into 
attributes of fine-grained activities and participating 
objects. 


8 Conclusion 

In this work we address two challenges that have not 
been widely explored so far, namely fine-grained ac¬ 
tivity recognition and composite activity recognition. 
In order to approach these tasks we propose the large 
activity database MPII Cooking 2. We recorded and 
annotated 273 videos of more than 27 hours with 30 
human subjects performing a large number of realistic 
cooking activities. Our database is unique with respect 
to size, length, complexity of the videos, and available 
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Ground- 

truth 

Dense Traj 


Dense Traj, 
Hand-Traj, 
-cSift 


cauliflower, cutting- 
board, hand, pull 
apart (A) 

hand, cutting-board, 
pull apart (A), onion, 
peel, cut apart(A) 
hand, cutting- 

board, cut apart (A), 
cauliflower, onion, pull 
apart (A) 


cauliflower, cut (A), 
cutting-board, knife 

knife, cutting-board, 

cut apart (A), counter, 
chefs-knife, cut (A) 
cauliflower, cut 

apart(A), knife, chefs- 
knife, cutting-board, 
cut(A) 


A 




hJZ 


1 Hi 


I 


Ground- 

truth 

Dense Traj 


Dense Traj, 
Hand-Traj, 
-cSift 


carrot, chefs-knife, 
cut off ends(A), 
cutting-board 
cutting-board, cut 
apart(A), chefs-knife, 
cut off ends(A), knife, 
put on(A) 

cutting-board, cut off 
ends(A), chefs-knife, 

cut apart (A), knife, car¬ 
rot 


carrot, front-peeler, 
peel(A) 

cutting-board, peel(A), 
front-peeler, chefs- 
knife, knife, cucumber 

cutting-board, peel(A), 
carrot, chefs-knife, 
front-peeler, cucumber 



add(A), cauliflower, 
colander, cutting- 

board, hand 
hand, cutting-board, 

move(A), counter, bowl, 

colander 

hand, cutting-board, 
move(A), counter, 

cauliflower, colander 



carrot, chefs-knife, 
cut stripes(A), 

cutting-board 
cutting-board, chefs- 
knife, slice(A), knife, 
cut apart (A), cucumber 

cutting-board, chefs- 
knife, slice(A), knife, 
carrot, cut apart (A) 



cauliflower, colander, 
hand, wash(A) 

hand, wash(A), plate, 
colander, onion, peel 

hand, wash(A), bowl, 
colander, cauliflower, 

onion 



carrot, chefs-knife, 
cut apart (A), cutting- 
board 

cutting-board, cut 

apart (A), chefs-knife, 

knife, cauliflower, cut off 
ends(A) 

cutting-board, cut 

apart (A), chefs-knife, 

cut off ends(A), knife, 

carrot 


Composites 

Preparing 

cauliflower 

Preparing 

orange 

Preparing 

cauliflower 


Composites 

Preparing 

carrot 

Preparing 

cucumber 


Preparing 

carrot 



Ground- 

truth 

Dense Traj 


Dense Traj, 
Hand-Traj, 
-cSift 


knife, onion, peel(A) 

peel(A), hand, onion, 
throw in garbage(A), 
bowl, front-peeler 
peel(A), hand, throw 
in garbage(A), onion, 
knife, peel 


chop(A), cutting- 
board, knife, onion 

cutting-board, knife, 

cut dice(A), onion, 
chop(A), slice(A) 
cutting-board, knife, 
cut dice(A), slice(A), 
chop (A), chive 


add(A), cutting- 

board, frying-pan, 
knife, onion 
hand, frying-pan, 

cutting-board, pot, 
spatula, add (A) 
hand, frying-pan, 

add (A), pot, spatula, 
cauliflower 


frying-pan, onion, 
spatula, stir(A) 

spatula, frying-pan, 
stir (A), onion, add(A), 
egg. 

frying-pan, spatula, 
stir(A), onion, add(A), 
broccoli 


Composites 

Preparing 

onion 

Preparing 

onion 

Preparing 

onion 


Table 11 Qualitative results for Dense Trajectories and its combination with hand-centric features (line 10 in Table|5j) with 
respect to ground-truth. Top-6 highest scoring attributes (activities and objects) are shown, where (A) denotes activities. 
Composite activity predictions shown on the right. Correct results marked with bold. Note that many attributes are not 
correct according to the ground truth but very similar, e.g. we predict slice instead of cut stripes. 
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annotations (activities, objects, human pose, text de¬ 
scriptions) . 

To estimate the complexity of fine-grained activity 
recognition in our database we compare three types of 
approaches: pose-based, hand-centric, and holistic. We 
evaluate on a classification and the often neglected de¬ 
tection task. Our results show that for recognizing fine¬ 
grained activities and their participating objects it is 
beneficial to focus on hand regions as the activities are 
hand-centric and the relevant objects are in the hand 
neighbourhood. 

Composite activities are difficult to recognize be¬ 
cause of their inherent variability and the lack of train¬ 
ing data for specific composites. We show that attribute- 
based activity recognition allows recognizing composite 
activities well. Most notably, we describe how textual 
script data, which is easy to collect, enables an improve¬ 
ment of the composite activity recognition when only 
little training data is available, and even allows for com¬ 
plete zero-shot transfer. 

As part of future work we plan to validate our hand¬ 
centric approach in other domains and exploit the scripts 
for composite activity recognition by modeling the tem¬ 
poral structure of the video. 
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